NYC Yellow Taxi Trip Analysis

How well can we model the number of taxi trips in a New York neighborhood in a 1 hour time period?

Predicting how many taxi trips originate in a certain neighborhood has several benefits:

  • Taxi companies can predict how many taxis should be available in a neighborhood
  • Passengers in a particular neighborhood can know the likelihood of being able to hail a taxi during a certain time of the day

In this project, I have used several open datasets to predict the number of taxi trips in a specific New York neighborhood in a 1 hour time period.

The datasets are the following:

  • NYC yellow taxi trips
  • NYC neighborhoods dataset
  • Historic weather data
In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import pandas as pd
import numpy as np
import csv
import sqlite3
import time
import json
from sklearn import cluster
from sklearn import feature_selection, linear_model
from mpl_toolkits.basemap import Basemap
from geopy.distance import vincenty, great_circle
import statsmodels.formula.api as smf
import sklearn
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeRegressor

# set some nicer defaults for matplotlib
from matplotlib import rcParams

#these colors come from colorbrewer2.org. Each is an RGB triplet
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
                (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
                (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
                (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
                (0.4, 0.6509803921568628, 0.11764705882352941),
                (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
                (0.6509803921568628, 0.4627450980392157, 0.11372549019607843),
                (0.4, 0.4, 0.4)]

rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.grid'] = False
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'none'
C:\Users\Anirban\Anaconda2\lib\site-packages\matplotlib\__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Monthly data analysis - Comparison between Yellow taxis and Uber

As part of exploratory analysis, I initially looked at how taxi trips vary during different months in a year. I follow analyses such as number of total trips at each hour, number of trips as a function of day of the week, and number of trips on each day of the month.

Here we look at the number of trips recorded by NYC Yellow taxis on a monthly basis. Records for 2013, 2014, and 2015 are analyzed. We also look at the number of Uber trips for the months of April to September 2014.

Each count list is created by iterating through all records of the corresponding month. Finally, a dataframe is created that contains all trip count data and this data is loaded into a csv.

It is not necessary to run the next 5 code blocks since the resulting dataframe has been loaded to a new file which can be directly loaded.

In [16]:
def parse_data(file_path, file_name, year, taxi_co, count_df):
    '''
    Counts the number of records in the files, formats it into a dataframe.
    Args:
        file_path (str): Path of the file.
        file_name (str): Name of the file without month number.
        year (int): year of acquired data
        taxi_co (str): Name of the taxi company whose data is being parsed
        count_df (DataFrame): Existing dataframe to which the data for the year will be added
    Returns:
        count_df (DataFrame): Dataframe containing the trips data for the year.
        
    '''
    count_list = []
    for i in range(12):
        file_path_iter = file_path + '/' + file_name + '_' + '%d'%year + '-%02d.csv'%(i+1)
        print file_path_iter

        try: 
            trip_data_file = open(file_path_iter, 'r')
            trip_data = csv.reader(trip_data_file)
            header = trip_data.next()

            count = 0
            for j in trip_data:
                count+=1
            count_list.append(count)
            trip_data_file.close()
        
        except:
            count_list.append(0)
    
    count_df['%d'%year + taxi_co + '_trips'] = count_list
    return count_df

We create a dataframe containing all the trip count data for 2013, 2014, and 2015 yellow taxi trips.

In [20]:
count_df = pd.DataFrame(columns = ['Month'])
count_df['Month'] = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul','Aug', 'Sep', 'Oct', 'Nov', 'Dec']

new_count_df = count_df

new_count_df = parse_data('../nyc_taxi_files/2013-data','yellow_tripdata',2013,'TLC',new_count_df)
new_count_df = parse_data('../nyc_taxi_files/2014-data','yellow_tripdata',2014,'TLC',new_count_df)
new_count_df = parse_data('../nyc_taxi_files/2015-data','yellow_tripdata',2015,'TLC',new_count_df)

new_count_df
../nyc_taxi_files/2013-data/yellow_tripdata_2013-01.csv
../nyc_taxi_files/2013-data/yellow_tripdata_2013-02.csv
../nyc_taxi_files/2013-data/yellow_tripdata_2013-03.csv
../nyc_taxi_files/2013-data/yellow_tripdata_2013-04.csv
../nyc_taxi_files/2013-data/yellow_tripdata_2013-05.csv
../nyc_taxi_files/2013-data/yellow_tripdata_2013-06.csv
../nyc_taxi_files/2013-data/yellow_tripdata_2013-07.csv
../nyc_taxi_files/2013-data/yellow_tripdata_2013-08.csv
../nyc_taxi_files/2013-data/yellow_tripdata_2013-09.csv
../nyc_taxi_files/2013-data/yellow_tripdata_2013-10.csv
../nyc_taxi_files/2013-data/yellow_tripdata_2013-11.csv
../nyc_taxi_files/2013-data/yellow_tripdata_2013-12.csv
../nyc_taxi_files/2014-data/yellow_tripdata_2014-01.csv
../nyc_taxi_files/2014-data/yellow_tripdata_2014-02.csv
../nyc_taxi_files/2014-data/yellow_tripdata_2014-03.csv
../nyc_taxi_files/2014-data/yellow_tripdata_2014-04.csv
../nyc_taxi_files/2014-data/yellow_tripdata_2014-05.csv
../nyc_taxi_files/2014-data/yellow_tripdata_2014-06.csv
../nyc_taxi_files/2014-data/yellow_tripdata_2014-07.csv
../nyc_taxi_files/2014-data/yellow_tripdata_2014-08.csv
../nyc_taxi_files/2014-data/yellow_tripdata_2014-09.csv
../nyc_taxi_files/2014-data/yellow_tripdata_2014-10.csv
../nyc_taxi_files/2014-data/yellow_tripdata_2014-11.csv
../nyc_taxi_files/2014-data/yellow_tripdata_2014-12.csv
../nyc_taxi_files/2015-data/yellow_tripdata_2015-01.csv
../nyc_taxi_files/2015-data/yellow_tripdata_2015-02.csv
../nyc_taxi_files/2015-data/yellow_tripdata_2015-03.csv
../nyc_taxi_files/2015-data/yellow_tripdata_2015-04.csv
../nyc_taxi_files/2015-data/yellow_tripdata_2015-05.csv
../nyc_taxi_files/2015-data/yellow_tripdata_2015-06.csv
../nyc_taxi_files/2015-data/yellow_tripdata_2015-07.csv
../nyc_taxi_files/2015-data/yellow_tripdata_2015-08.csv
../nyc_taxi_files/2015-data/yellow_tripdata_2015-09.csv
../nyc_taxi_files/2015-data/yellow_tripdata_2015-10.csv
../nyc_taxi_files/2015-data/yellow_tripdata_2015-11.csv
../nyc_taxi_files/2015-data/yellow_tripdata_2015-12.csv
Out[20]:
Month 2013TLC_trips 2014TLC_trips 2015TLC_trips
0 Jan 14776615 13782493 12748986
1 Feb 13990176 13063792 12450521
2 Mar 15749228 15428128 13351609
3 Apr 15100468 14618760 13071789
4 May 15285049 14774042 13158262
5 Jun 14385456 13813030 12324935
6 Jul 13823840 13106366 11562783
7 Aug 12597109 12688878 11130304
8 Sep 14107693 13374017 11225063
9 Oct 15004556 14232488 12315488
10 Nov 14388451 13218217 11312676
11 Dec 13971118 13014162 11460573

We append the April to September data for Uber to this dataframe. The final dataframe is saved in a csv file and can be loaded in order to run the subsequent analysis

In [21]:
# uber_count_df = pd.DataFrame(columns = ['Month'])
# uber_count_df['Month'] = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul','Aug', 'Sep', 'Oct', 'Nov', 'Dec']

new_count_df = parse_data('../nyc_taxi_files/uber-trip-data','uber-raw-data',2014,'Uber',new_count_df)

new_count_df
../nyc_taxi_files/uber-trip-data/uber-raw-data_2014-01.csv
../nyc_taxi_files/uber-trip-data/uber-raw-data_2014-02.csv
../nyc_taxi_files/uber-trip-data/uber-raw-data_2014-03.csv
../nyc_taxi_files/uber-trip-data/uber-raw-data_2014-04.csv
../nyc_taxi_files/uber-trip-data/uber-raw-data_2014-05.csv
../nyc_taxi_files/uber-trip-data/uber-raw-data_2014-06.csv
../nyc_taxi_files/uber-trip-data/uber-raw-data_2014-07.csv
../nyc_taxi_files/uber-trip-data/uber-raw-data_2014-08.csv
../nyc_taxi_files/uber-trip-data/uber-raw-data_2014-09.csv
../nyc_taxi_files/uber-trip-data/uber-raw-data_2014-10.csv
../nyc_taxi_files/uber-trip-data/uber-raw-data_2014-11.csv
../nyc_taxi_files/uber-trip-data/uber-raw-data_2014-12.csv
Out[21]:
Month 2013TLC_trips 2014TLC_trips 2015TLC_trips 2014Uber_trips
0 Jan 14776615 13782493 12748986 0
1 Feb 13990176 13063792 12450521 0
2 Mar 15749228 15428128 13351609 0
3 Apr 15100468 14618760 13071789 564516
4 May 15285049 14774042 13158262 652435
5 Jun 14385456 13813030 12324935 663844
6 Jul 13823840 13106366 11562783 796121
7 Aug 12597109 12688878 11130304 829275
8 Sep 14107693 13374017 11225063 1028136
9 Oct 15004556 14232488 12315488 0
10 Nov 14388451 13218217 11312676 0
11 Dec 13971118 13014162 11460573 0
In [22]:
new_count_df.to_csv('../nyc_taxi_files/trip_count_all_years.csv',index=False)
In [23]:
new_count_df = pd.read_csv('../nyc_taxi_files/trip_count_all_years.csv')
new_count_df
Out[23]:
Month 2013TLC_trips 2014TLC_trips 2015TLC_trips 2014Uber_trips
0 Jan 14776615 13782493 12748986 0
1 Feb 13990176 13063792 12450521 0
2 Mar 15749228 15428128 13351609 0
3 Apr 15100468 14618760 13071789 564516
4 May 15285049 14774042 13158262 652435
5 Jun 14385456 13813030 12324935 663844
6 Jul 13823840 13106366 11562783 796121
7 Aug 12597109 12688878 11130304 829275
8 Sep 14107693 13374017 11225063 1028136
9 Oct 15004556 14232488 12315488 0
10 Nov 14388451 13218217 11312676 0
11 Dec 13971118 13014162 11460573 0
In [24]:
plt.figure(figsize=(20,10))

ax1 = plt.subplot(121)
count_df.plot(x='Month', y='2013TLC_trips', kind='line', color = 'red', ax=ax1, label='2013 trips')
count_df.plot(x='Month', y='2014TLC_trips', kind='line', color = 'blue', ax=ax1, label='2014 trips')
count_df.plot(x='Month', y='2015TLC_trips', kind='line', color = 'green', ax=ax1, label='2015 trips')

ax1.set(title='Number of trips each month')

ax2 = plt.subplot(122)
count_df[count_df.Month.isin(['Apr','May','Jun','Jul','Aug','Sep'])].plot(x='Month', y='2014Uber_trips', kind='line', 
                                                                          ax=ax2, label='2014 Uber trips')
ax2.set(title='Number of Uber trips from Apr to Sep')
Out[24]:
[<matplotlib.text.Text at 0xaadafd0>]
In [25]:
count_df_sorted=count_df.sort_values(by='2013TLC_trips')
count_df_sorted.plot(x='Month',y='2013TLC_trips',kind='bar',ylim=(1.2e7, 1.6e7))
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0xae046d8>

Trip data analysis for a single month

The following code blocks perform two tasks:

  • Reads in the January trip and fare data for 2013 contained in two different files.
  • Join both files using a merge operation based on unique identifiers 'medallion','hack_license','vendor_id', and 'pickup_datetime'

The two files contain one-to-one mappings of records corresponding to the same trips and so a plain merge using the primary keys will join the two files.

In [5]:
%%time

trip_data_file=pd.read_csv('../nyc_taxi_files/2013-data/yellow_tripdata_2013-01.csv')
trip_fare_file=pd.read_csv('../nyc_taxi_files/2013-data/trip_fare_01.csv')
Wall time: 2min 42s
In [34]:
trip_data_file.columns
Out[34]:
Index([u'medallion', u'hack_license', u'vendor_id', u'rate_code',
       u'store_and_fwd_flag', u'pickup_datetime', u'dropoff_datetime',
       u'passenger_count', u'trip_time_in_secs', u'trip_distance',
       u'pickup_longitude', u'pickup_latitude', u'dropoff_longitude',
       u'dropoff_latitude'],
      dtype='object')
In [35]:
trip_fare_file.columns
Out[35]:
Index([u'medallion', u' hack_license', u' vendor_id', u' pickup_datetime',
       u' payment_type', u' fare_amount', u' surcharge', u' mta_tax',
       u' tip_amount', u' tolls_amount', u' total_amount'],
      dtype='object')
In [7]:
trip_fare_file.rename(columns=lambda x: x.strip(), inplace = True)
In [37]:
trip_fare_file.columns
Out[37]:
Index([u'medallion', u'hack_license', u'vendor_id', u'pickup_datetime',
       u'payment_type', u'fare_amount', u'surcharge', u'mta_tax',
       u'tip_amount', u'tolls_amount', u'total_amount'],
      dtype='object')
In [31]:
trip_data_file.tail()
Out[31]:
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime dropoff_datetime passenger_count trip_time_in_secs trip_distance pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude
14776610 B33E71CD9E8FE1BE3B70FEB6E807DD15 BAF57796E45D921BB23217E17A372FF6 CMT 1 N 2013-01-06 04:58:23 2013-01-06 05:11:24 1 781 3.3 -73.989029 40.759327 -73.953743 40.770672
14776611 ED160B76D5349C8AC1ECF22CD4B8D538 3B93F6DA5DEBDE9560993FA624C4FF76 CMT 1 N 2013-01-08 14:42:04 2013-01-08 14:50:27 1 503 1.0 -73.993042 40.733990 -73.982483 40.724823
14776612 D83F9AC0E33F6F19869C243BE6AB6FE5 85A55B6772275374EF90AC9457DC1F83 CMT 1 N 2013-01-10 13:29:23 2013-01-10 13:34:45 1 321 0.9 -73.979553 40.785011 -73.968262 40.788158
14776613 04E59442A7DDBCE515E33CD355D866E7 7913172189931A1A1632562B10AB53C4 CMT 1 N 2013-01-06 16:30:15 2013-01-06 16:42:26 1 730 1.3 -73.968002 40.762161 -73.985992 40.770542
14776614 D30BED60331C79E3F7ACD05B325ED42F B5E1D2461A5BCC8819188DACEC17CD69 CMT 1 N 2013-01-05 20:38:46 2013-01-05 20:43:06 1 260 0.8 -73.982224 40.766670 -73.989212 40.773636
In [32]:
trip_fare_file.tail()
Out[32]:
medallion hack_license vendor_id pickup_datetime payment_type fare_amount surcharge mta_tax tip_amount tolls_amount total_amount
14776610 B33E71CD9E8FE1BE3B70FEB6E807DD15 BAF57796E45D921BB23217E17A372FF6 CMT 2013-01-06 04:58:23 CSH 13.0 0.5 0.5 0.0 0.0 14.0
14776611 ED160B76D5349C8AC1ECF22CD4B8D538 3B93F6DA5DEBDE9560993FA624C4FF76 CMT 2013-01-08 14:42:04 CSH 7.5 0.0 0.5 0.0 0.0 8.0
14776612 D83F9AC0E33F6F19869C243BE6AB6FE5 85A55B6772275374EF90AC9457DC1F83 CMT 2013-01-10 13:29:23 CSH 6.0 0.0 0.5 0.0 0.0 6.5
14776613 04E59442A7DDBCE515E33CD355D866E7 7913172189931A1A1632562B10AB53C4 CMT 2013-01-06 16:30:15 CSH 9.5 0.0 0.5 0.0 0.0 10.0
14776614 D30BED60331C79E3F7ACD05B325ED42F B5E1D2461A5BCC8819188DACEC17CD69 CMT 2013-01-05 20:38:46 CSH 5.0 0.5 0.5 0.0 0.0 6.0

I now perform the join on the two files.

Subsequently, I add two more columns to the data which are the hours of pickup and dropoff for a particular trip. The values of both these columns range from 0 to 23.

In [8]:
%%time

trip_full_df=pd.DataFrame.merge(trip_data_file, trip_fare_file, 
                             on=['medallion','hack_license','vendor_id','pickup_datetime'])
Wall time: 11min 14s
In [19]:
%%time

trip_full_df['pickup_hour']=trip_full_df['pickup_datetime'].apply(lambda x: time.strptime(x, "%Y-%m-%d %H:%M:%S").tm_hour)
trip_full_df['dropoff_hour']=trip_full_df['dropoff_datetime'].apply(lambda x: time.strptime(x, "%Y-%m-%d %H:%M:%S").tm_hour)
Wall time: 8min 27s
Compiler : 126 ms
In [20]:
trip_full_df.shape
Out[20]:
(14776615, 23)
In [21]:
trip_full_df.head()
Out[21]:
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime dropoff_datetime passenger_count trip_time_in_secs trip_distance ... dropoff_latitude payment_type fare_amount surcharge mta_tax tip_amount tolls_amount total_amount pickup_hour dropoff_hour
0 89D227B655E5C82AECF13C3F540D4CF4 BA96DE419E711691B9445D6A6307C170 CMT 1 N 2013-01-01 15:11:48 2013-01-01 15:18:10 4 382 1.0 ... 40.751171 CSH 6.5 0.0 0.5 0.0 0.0 7.0 15 15
1 0BD7C8F5BA12B88E0B67BED28BEA73D8 9FD8F69F0804BDB5549F40E9DA1BE472 CMT 1 N 2013-01-06 00:18:35 2013-01-06 00:22:54 1 259 1.5 ... 40.750660 CSH 6.0 0.5 0.5 0.0 0.0 7.0 0 0
2 0BD7C8F5BA12B88E0B67BED28BEA73D8 9FD8F69F0804BDB5549F40E9DA1BE472 CMT 1 N 2013-01-05 18:49:41 2013-01-05 18:54:23 1 282 1.1 ... 40.726002 CSH 5.5 1.0 0.5 0.0 0.0 7.0 18 18
3 DFD2202EE08F7A8DC9A57B02ACB81FE2 51EE87E3205C985EF8431D850C786310 CMT 1 N 2013-01-07 23:54:15 2013-01-07 23:58:20 2 244 0.7 ... 40.759388 CSH 5.0 0.5 0.5 0.0 0.0 6.0 23 23
4 DFD2202EE08F7A8DC9A57B02ACB81FE2 51EE87E3205C985EF8431D850C786310 CMT 1 N 2013-01-07 23:25:03 2013-01-07 23:34:24 1 560 2.1 ... 40.747868 CSH 9.5 0.5 0.5 0.0 0.0 10.5 23 23

5 rows × 23 columns

Exploratory analysis of trip counts

I now perform some exploratory analysis and visualizations. I look at how number of trips vary by hour of day, day of week, and, at a more granular level, for each day of the month.

I also look at how many trips an individual driver makes on a given month. Individual drivers can be identified using the medallion number

In [22]:
%%time

# The fare amount is converted to a float datatype
# An additional tip fraction column is added for analysis of tips which will be performed later

trip_full_df['fare_amount'] = trip_full_df['fare_amount'].astype(float)
trip_full_df['tip_fraction'] = trip_full_df['tip_amount']/trip_full_df['fare_amount']
Wall time: 2.09 s
Compiler : 104 ms
In [23]:
# By grouping by the pickup hour, and then plotting a histogram with 24 bins, we can look at the distribution of trips on an hourly basis.

trip_full_df.pickup_hour.groupby(trip_full_df.pickup_hour.values).size().plot(kind='bar')
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0xc7de7470>
In [24]:
%%time

# Adding a new column for the day of the week. I do this by first extracting the wday number and then mapping each day number 
# to a day of the week

trip_full_df['pickup_wday'] = trip_full_df['pickup_datetime'].apply(lambda x: time.strptime(x, "%Y-%m-%d %H:%M:%S").tm_wday)
trip_full_df['pickup_day'] = trip_full_df['pickup_wday'].map({0:'Monday', 1:'Tuesday',2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'})
Wall time: 5min 22s

In order to look at the number of trips for each week day, I group by pickup_day column and then count the number of trips. It should be noted however that this bar chart does not give an accurate picture of the number of trips as a function of the day of the week, since in a given month some days might appear 4 times while some others appear 5 times.

In [25]:
trip_full_df['pickup_day'].groupby(trip_full_df.pickup_day.values).size().plot(kind='bar')
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x82772e48>
In [26]:
%%time

# I extract the trip date from the pickup_datetime column and assign it to a new column
trip_full_df['full_date'] = trip_full_df['pickup_datetime'].apply(lambda x: x.split(" ")[0])
Wall time: 14.2 s
In [27]:
%%time

# I now introduce a new column for day of the month and look at the number of trips for each day.

trip_full_df['month_day'] = trip_full_df['pickup_datetime'].apply(lambda x: time.strptime(x, "%Y-%m-%d %H:%M:%S").tm_mday)
trip_full_df['month_day'].groupby(trip_full_df.month_day.values).size().plot(kind='bar')
# trip_pickup_date.groupby(trip_pickup_date).size().plot(kind='bar')
Wall time: 3min 54s
In [28]:
trip_full_df.head()
Out[28]:
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime dropoff_datetime passenger_count trip_time_in_secs trip_distance ... tip_amount tolls_amount total_amount pickup_hour dropoff_hour tip_fraction pickup_wday pickup_day full_date month_day
0 89D227B655E5C82AECF13C3F540D4CF4 BA96DE419E711691B9445D6A6307C170 CMT 1 N 2013-01-01 15:11:48 2013-01-01 15:18:10 4 382 1.0 ... 0.0 0.0 7.0 15 15 0.0 1 Tuesday 2013-01-01 1
1 0BD7C8F5BA12B88E0B67BED28BEA73D8 9FD8F69F0804BDB5549F40E9DA1BE472 CMT 1 N 2013-01-06 00:18:35 2013-01-06 00:22:54 1 259 1.5 ... 0.0 0.0 7.0 0 0 0.0 6 Sunday 2013-01-06 6
2 0BD7C8F5BA12B88E0B67BED28BEA73D8 9FD8F69F0804BDB5549F40E9DA1BE472 CMT 1 N 2013-01-05 18:49:41 2013-01-05 18:54:23 1 282 1.1 ... 0.0 0.0 7.0 18 18 0.0 5 Saturday 2013-01-05 5
3 DFD2202EE08F7A8DC9A57B02ACB81FE2 51EE87E3205C985EF8431D850C786310 CMT 1 N 2013-01-07 23:54:15 2013-01-07 23:58:20 2 244 0.7 ... 0.0 0.0 6.0 23 23 0.0 0 Monday 2013-01-07 7
4 DFD2202EE08F7A8DC9A57B02ACB81FE2 51EE87E3205C985EF8431D850C786310 CMT 1 N 2013-01-07 23:25:03 2013-01-07 23:34:24 1 560 2.1 ... 0.0 0.0 10.5 23 23 0.0 0 Monday 2013-01-07 7

5 rows × 28 columns

In [29]:
%%time

# Looking at the distribution of trips on a particular day. I choose Jan 21 since it had the least number of trips.
tripdata_jan21 = trip_full_df[trip_full_df.full_date == '2013-01-21']
Wall time: 12min 55s
Parser   : 136 ms
In [30]:
tripdata_jan21['pickup_hour'].groupby(tripdata_jan21['pickup_hour'].values).size().plot(kind='bar')
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x12dbaac8>
In [31]:
%%time

# Comparing trips on Jan 13 (Sunday) with Jan 16 (Wednesday)
tripdata_jan13 = trip_full_df[trip_full_df.full_date == '2013-01-13']
tripdata_jan16 = trip_full_df[trip_full_df.full_date == '2013-01-16']
Wall time: 3.67 s
In [32]:
tripdata_jan13['pickup_hour'].groupby(tripdata_jan13['pickup_hour'].values).size().plot(kind='bar')
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x47c7d940>
In [33]:
tripdata_jan16['pickup_hour'].groupby(tripdata_jan16['pickup_hour'].values).size().plot(kind='bar')
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x17687278>

As we see, Jan 13 which is a Sunday has a very different pattern than Jan 16.

I also wanted to look at how many trips an individual driver typically partakes. In order to look at this, I looked at the distribution of the trips after grouping by the medallion number.

In [34]:
trip_full_df.groupby(['medallion']).size().hist(bins=50)
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x70fb6240>
In [35]:
trip_full_df.describe()
C:\Users\Anirban\Anaconda2\lib\site-packages\numpy\lib\function_base.py:3403: RuntimeWarning: Invalid value encountered in median
  RuntimeWarning)
Out[35]:
rate_code passenger_count trip_time_in_secs trip_distance pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude fare_amount surcharge mta_tax tip_amount tolls_amount total_amount pickup_hour dropoff_hour tip_fraction pickup_wday month_day
count 1.477662e+07 1.477662e+07 1.477662e+07 1.477662e+07 1.477662e+07 1.477662e+07 1.477653e+07 1.477653e+07 1.477662e+07 1.477662e+07 1.477662e+07 1.477662e+07 1.477662e+07 1.477662e+07 1.477662e+07 1.477662e+07 1.477662e+07 1.477662e+07 1.477662e+07
mean 1.034273e+00 1.697372e+00 6.834236e+02 2.770976e+00 -7.263634e+01 4.001440e+01 -7.259443e+01 3.999219e+01 1.166472e+01 3.204904e-01 4.983970e-01 1.267509e+00 2.018670e-01 1.395298e+01 1.358165e+01 1.358404e+01 1.064043e-01 2.962503e+00 1.644678e+01
std 3.387715e-01 1.365396e+00 4.944063e+02 3.305923e+00 1.013819e+01 7.789904e+00 1.028860e+01 7.537067e+00 9.639219e+00 3.675741e-01 2.826541e-02 2.046084e+00 1.035481e+00 1.146469e+01 6.401632e+00 6.461708e+00 1.652766e-01 1.895188e+00 8.838428e+00
min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 -2.771285e+03 -3.547921e+03 -2.350956e+03 -3.547921e+03 2.500000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.500000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000e+00
25% 1.000000e+00 1.000000e+00 3.600000e+02 1.000000e+00 -7.399188e+01 4.073551e+01 NaN NaN 6.500000e+00 0.000000e+00 5.000000e-01 0.000000e+00 0.000000e+00 7.700000e+00 9.000000e+00 9.000000e+00 0.000000e+00 1.000000e+00 9.000000e+00
50% 1.000000e+00 1.000000e+00 5.540000e+02 1.700000e+00 -7.398166e+01 4.075315e+01 NaN NaN 9.000000e+00 0.000000e+00 5.000000e-01 8.000000e-01 0.000000e+00 1.050000e+01 1.400000e+01 1.400000e+01 7.142857e-02 3.000000e+00 1.700000e+01
75% 1.000000e+00 2.000000e+00 8.850000e+02 3.060000e+00 -7.396684e+01 4.076729e+01 NaN NaN 1.300000e+01 5.000000e-01 5.000000e-01 2.000000e+00 0.000000e+00 1.550000e+01 1.900000e+01 1.900000e+01 2.095238e-01 5.000000e+00 2.400000e+01
max 2.100000e+02 2.550000e+02 1.080000e+04 1.000000e+02 1.124042e+02 3.310365e+03 2.228738e+03 3.477106e+03 5.000000e+02 1.250000e+01 5.000000e-01 2.000000e+02 2.000000e+01 6.500000e+02 2.300000e+01 2.300000e+01 7.880000e+01 6.000000e+00 3.100000e+01

Few things we notice. The minimum distance for a trip is 0. Any trip which has a distance of 0 should be filtered during analysis and modeling. Also, the latitude and longitude columns for dropoff have some NaN values. The latitude and longitude also have large numbers that make no sense (e.g the minimum value for pickup longitude is -2771).

In [36]:
%%time

trip_df_nonzero = trip_full_df[trip_full_df.trip_distance.astype(float) != 0]
Wall time: 6min 32s
Parser   : 138 ms
In [37]:
print trip_df_nonzero.shape
trip_df_nonzero.head()
(14693239, 28)
Out[37]:
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime dropoff_datetime passenger_count trip_time_in_secs trip_distance ... tip_amount tolls_amount total_amount pickup_hour dropoff_hour tip_fraction pickup_wday pickup_day full_date month_day
0 89D227B655E5C82AECF13C3F540D4CF4 BA96DE419E711691B9445D6A6307C170 CMT 1 N 2013-01-01 15:11:48 2013-01-01 15:18:10 4 382 1.0 ... 0.0 0.0 7.0 15 15 0.0 1 Tuesday 2013-01-01 1
1 0BD7C8F5BA12B88E0B67BED28BEA73D8 9FD8F69F0804BDB5549F40E9DA1BE472 CMT 1 N 2013-01-06 00:18:35 2013-01-06 00:22:54 1 259 1.5 ... 0.0 0.0 7.0 0 0 0.0 6 Sunday 2013-01-06 6
2 0BD7C8F5BA12B88E0B67BED28BEA73D8 9FD8F69F0804BDB5549F40E9DA1BE472 CMT 1 N 2013-01-05 18:49:41 2013-01-05 18:54:23 1 282 1.1 ... 0.0 0.0 7.0 18 18 0.0 5 Saturday 2013-01-05 5
3 DFD2202EE08F7A8DC9A57B02ACB81FE2 51EE87E3205C985EF8431D850C786310 CMT 1 N 2013-01-07 23:54:15 2013-01-07 23:58:20 2 244 0.7 ... 0.0 0.0 6.0 23 23 0.0 0 Monday 2013-01-07 7
4 DFD2202EE08F7A8DC9A57B02ACB81FE2 51EE87E3205C985EF8431D850C786310 CMT 1 N 2013-01-07 23:25:03 2013-01-07 23:34:24 1 560 2.1 ... 0.0 0.0 10.5 23 23 0.0 0 Monday 2013-01-07 7

5 rows × 28 columns

From a histogram of all latitude and longitude values (shown later as part of visualization), it was observed that most taxi trips take place between the latitudes of 40.7 and 40.8 and longitudes of -74.05 and -73.8

In [38]:
%%time

# Converting all location data to floats for easier analysis
trip_df_nonzero[['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude']] = \
trip_df_nonzero[['pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude']].astype(float)

# filtering the records belonging to relevant latitudes and longitudes
trips_with_location = trip_df_nonzero[(trip_df_nonzero.pickup_latitude >= 40.7) & \
                                                        (trip_df_nonzero.pickup_latitude <= 40.8) &\
                (trip_df_nonzero.pickup_longitude >= -74.05) & (trip_df_nonzero.pickup_longitude <= -73.8)]
C:\Users\Anirban\Anaconda2\lib\site-packages\pandas\core\frame.py:2378: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]
Wall time: 10min 57s
In [39]:
print trips_with_location.shape
trips_with_location.head()
(13754630, 28)
Out[39]:
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime dropoff_datetime passenger_count trip_time_in_secs trip_distance ... tip_amount tolls_amount total_amount pickup_hour dropoff_hour tip_fraction pickup_wday pickup_day full_date month_day
0 89D227B655E5C82AECF13C3F540D4CF4 BA96DE419E711691B9445D6A6307C170 CMT 1 N 2013-01-01 15:11:48 2013-01-01 15:18:10 4 382 1.0 ... 0.0 0.0 7.0 15 15 0.0 1 Tuesday 2013-01-01 1
1 0BD7C8F5BA12B88E0B67BED28BEA73D8 9FD8F69F0804BDB5549F40E9DA1BE472 CMT 1 N 2013-01-06 00:18:35 2013-01-06 00:22:54 1 259 1.5 ... 0.0 0.0 7.0 0 0 0.0 6 Sunday 2013-01-06 6
2 0BD7C8F5BA12B88E0B67BED28BEA73D8 9FD8F69F0804BDB5549F40E9DA1BE472 CMT 1 N 2013-01-05 18:49:41 2013-01-05 18:54:23 1 282 1.1 ... 0.0 0.0 7.0 18 18 0.0 5 Saturday 2013-01-05 5
3 DFD2202EE08F7A8DC9A57B02ACB81FE2 51EE87E3205C985EF8431D850C786310 CMT 1 N 2013-01-07 23:54:15 2013-01-07 23:58:20 2 244 0.7 ... 0.0 0.0 6.0 23 23 0.0 0 Monday 2013-01-07 7
4 DFD2202EE08F7A8DC9A57B02ACB81FE2 51EE87E3205C985EF8431D850C786310 CMT 1 N 2013-01-07 23:25:03 2013-01-07 23:34:24 1 560 2.1 ... 0.0 0.0 10.5 23 23 0.0 0 Monday 2013-01-07 7

5 rows × 28 columns

Data Munging and analysis with a subset

We extract a sample from the overall dataset. Later, we will perform some feature engineering on this subset as an experimental step before running the feature engineering algorithm on the entire dataset. Additionally, from this sample, we filter only those trips that are within those latitudes and longitudes where a large number of taxi trips take place.

The sample data has been stored in a separate file so that the sampling code does not have to be run each time.

In [40]:
%%time

sample_trips_with_location = trips_with_location.sample(n=500000, random_state=1)
Wall time: 25min 1s
Parser   : 679 ms
In [41]:
sample_trips_with_location.head()
Out[41]:
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime dropoff_datetime passenger_count trip_time_in_secs trip_distance ... tip_amount tolls_amount total_amount pickup_hour dropoff_hour tip_fraction pickup_wday pickup_day full_date month_day
10398690 F47E83A52C8401C874B99A98CCE246DF 4E260C3310346E735CBDF8C8E6761D66 VTS 1 NaN 2013-01-30 16:34:00 2013-01-30 16:43:00 1 540 0.86 ... 0.0 0.0 7.5 16 16 0.000000 2 Wednesday 2013-01-30 30
7409709 3AA58DEEC2CA4C104FCA781EC38B6DA1 8F986F92BDA7196564FAA2766EFCD3AD CMT 1 N 2013-01-16 22:31:15 2013-01-16 22:37:24 1 368 1.20 ... 0.0 0.0 7.5 22 22 0.000000 2 Wednesday 2013-01-16 16
10837498 BEEB829C76993A19098AFA1F5C0C88D2 1C3DFBBA24C7057BF7CB1E48E28153F0 CMT 1 N 2013-01-28 14:34:48 2013-01-28 14:47:36 1 767 1.90 ... 2.2 0.0 13.2 14 14 0.209524 0 Monday 2013-01-28 28
5156833 3E7AA144EBE43CE6B567524550C78DFF 4AEEA5836012EA84DF0E246D002E1A41 VTS 1 NaN 2013-01-18 18:43:00 2013-01-18 18:50:00 5 420 1.22 ... 1.6 0.0 10.1 18 18 0.228571 4 Friday 2013-01-18 18
9110142 A5C6F0FC150417517C2894400795B1F8 97336339EBD6E99950B8D11E083E932B VTS 1 NaN 2013-01-22 00:21:00 2013-01-22 00:37:00 1 960 6.94 ... 0.0 0.0 22.0 0 0 0.000000 1 Tuesday 2013-01-22 22

5 rows × 28 columns

In order to make sure that the sample data is fairly representative of the overall dataset, I looked at the trip count distribution for each one-hour slot.

In [42]:
sample_trips_with_location['pickup_hour'].groupby(sample_trips_with_location['pickup_hour'].values).size().plot(kind='bar')
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x12e07518>
In [43]:
%%time

sample_trips_with_location.to_csv('../nyc_taxi_files/2013_Jan_samples.csv',index=False)
Wall time: 17.1 s
In [44]:
%%time

sample_trips_with_location = pd.read_csv('../nyc_taxi_files/2013_Jan_samples.csv')
Wall time: 10.4 s
In [45]:
sample_trips_with_location.head()
Out[45]:
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime dropoff_datetime passenger_count trip_time_in_secs trip_distance ... tip_amount tolls_amount total_amount pickup_hour dropoff_hour tip_fraction pickup_wday pickup_day full_date month_day
0 F47E83A52C8401C874B99A98CCE246DF 4E260C3310346E735CBDF8C8E6761D66 VTS 1 NaN 2013-01-30 16:34:00 2013-01-30 16:43:00 1 540 0.86 ... 0.0 0.0 7.5 16 16 0.000000 2 Wednesday 2013-01-30 30
1 3AA58DEEC2CA4C104FCA781EC38B6DA1 8F986F92BDA7196564FAA2766EFCD3AD CMT 1 N 2013-01-16 22:31:15 2013-01-16 22:37:24 1 368 1.20 ... 0.0 0.0 7.5 22 22 0.000000 2 Wednesday 2013-01-16 16
2 BEEB829C76993A19098AFA1F5C0C88D2 1C3DFBBA24C7057BF7CB1E48E28153F0 CMT 1 N 2013-01-28 14:34:48 2013-01-28 14:47:36 1 767 1.90 ... 2.2 0.0 13.2 14 14 0.209524 0 Monday 2013-01-28 28
3 3E7AA144EBE43CE6B567524550C78DFF 4AEEA5836012EA84DF0E246D002E1A41 VTS 1 NaN 2013-01-18 18:43:00 2013-01-18 18:50:00 5 420 1.22 ... 1.6 0.0 10.1 18 18 0.228571 4 Friday 2013-01-18 18
4 A5C6F0FC150417517C2894400795B1F8 97336339EBD6E99950B8D11E083E932B VTS 1 NaN 2013-01-22 00:21:00 2013-01-22 00:37:00 1 960 6.94 ... 0.0 0.0 22.0 0 0 0.000000 1 Tuesday 2013-01-22 22

5 rows × 28 columns

In [46]:
sample_trips_with_location.shape
Out[46]:
(500000, 28)

Neighborhoods Data

We read in the neighborhood dataset available in Open NYC Data and extract the latitude and longitude values of each. We will later use this dataset to add another column in the taxi data.

In [47]:
%%time

neighborhood_df = pd.read_csv('https://data.cityofnewyork.us/api/views/xyye-rtrs/rows.csv')
Wall time: 1min 4s
In [48]:
neighborhood_df.tail()
Out[48]:
the_geom OBJECTID Name Stacked AnnoLine1 AnnoLine2 AnnoLine3 AnnoAngle Borough
294 POINT (-74.13792663771576 40.576506293794935) 295 Lighthouse Hill 2 Lighthouse Hill NaN 0 Staten Island
295 POINT (-74.22957080626948 40.519541457489126) 296 Richmond Valley 2 Richmond Valley NaN 0 Staten Island
296 POINT (-73.8266775713865 40.790601556701525) 297 Malba 1 Malba NaN NaN 0 Queens
297 POINT (-73.8902809854761 40.68248551101694) 298 Highland Park 2 Highland Park NaN 0 Brooklyn
298 POINT (-73.948415153289 40.609377701137696) 299 Madison 1 Madison NaN NaN 0 Brooklyn

I extract the latitude and longitude information from the 'the_geom' parameter in the neighborhood dataset. These values are stored in two new columns.

In [49]:
%%time

neighborhood_df['longitude']=neighborhood_df['the_geom'].apply(lambda x: (x[7: -1]).split(' ')[0])
neighborhood_df['latitude']=neighborhood_df['the_geom'].apply(lambda x: (x[7: -1]).split(' ')[1])
Wall time: 120 ms
In [50]:
# Change the datatype of the latitude and longitude columns to float

neighborhood_df['longitude'] = neighborhood_df['longitude'].astype(float)
neighborhood_df['latitude'] = neighborhood_df['latitude'].astype(float)
In [51]:
neighborhood_df.head()
Out[51]:
the_geom OBJECTID Name Stacked AnnoLine1 AnnoLine2 AnnoLine3 AnnoAngle Borough longitude latitude
0 POINT (-73.8472005205491 40.89470517661004) 1 Wakefield 1 Wakefield NaN NaN 0 Bronx -73.847201 40.894705
1 POINT (-73.82993910812405 40.87429419303015) 2 Co-op City 2 Co-op City NaN 0 Bronx -73.829939 40.874294
2 POINT (-73.82780644716419 40.88755567735082) 3 Eastchester 1 Eastchester NaN NaN 0 Bronx -73.827806 40.887556
3 POINT (-73.90564259591689 40.895437426903875) 4 Fieldston 1 Fieldston NaN NaN 0 Bronx -73.905643 40.895437
4 POINT (-73.91258546108577 40.89083449389134) 5 Riverdale 1 Riverdale NaN NaN 0 Bronx -73.912585 40.890834

Inserting neighborhood details within a sample of the data

I continue working with the sample dataset extracted from the overall dataset earlier. I now combine this sample dataset with the neighborhoods data so that each trip is assigned a neighborhood.

Seen below are the various neighborhoods from the dataset. In order to assign a particular neighborhood to each taxi trip, we would have to iterate through all the neighborhoods to find which one is the closest to a particular pickup point.

In [52]:
plt.scatter(x = neighborhood_df['longitude'], y = neighborhood_df['latitude'])
Out[52]:
<matplotlib.collections.PathCollection at 0x99e39860>

K-means clustering on the neighborhoods dataset.

Instead of iterating through the entire neighborhood dataset while trying to assign a neighborhood to each taxi trip, we can instead use K-means clustering to split the dataset into 5 clusters. Subsequently, we can determine the cluster to which each of the pickup point from the taxi trip dataset belongs to, and then search only within that cluster for the exact neighborhood. This way, we can reduce the processing time by almost a factor of 5.

A possible downside for this approach is that it is possible that we have a cluster centroid lying close to the pickup latitude and longitude but the actual neighborhood closest to the pickup point lies at the border of a different cluster. For now, we ignore these border cases.

In [53]:
centroids = []
cls = cluster.k_means(neighborhood_df[['longitude', 'latitude']].values, 5)
centroids.append(cls[0])
neighborhood_df['clusters'] = cls[1] # A new column is added to the neighborhoods dataset

print cls
(array([[-73.93652291,  40.62402866],
       [-73.79772289,  40.70843049],
       [-74.13890585,  40.58839034],
       [-73.96713601,  40.72986982],
       [-73.87915333,  40.84291032]]), array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 3, 0, 0, 1, 0,
       0, 0, 0, 0, 3, 3, 3, 3, 0, 0, 3, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1,
       1, 1, 1, 3, 3, 4, 3, 3, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 4, 1, 1, 1, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 3, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3,
       2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 1, 1, 1, 4, 4, 4, 4, 4, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 0, 0, 3, 2, 2, 2, 2, 0, 0, 1, 2, 2, 2, 4, 0, 0]), 0.97057605150528592)
In [54]:
neighborhood_df.tail()
Out[54]:
the_geom OBJECTID Name Stacked AnnoLine1 AnnoLine2 AnnoLine3 AnnoAngle Borough longitude latitude clusters
294 POINT (-74.13792663771576 40.576506293794935) 295 Lighthouse Hill 2 Lighthouse Hill NaN 0 Staten Island -74.137927 40.576506 2
295 POINT (-74.22957080626948 40.519541457489126) 296 Richmond Valley 2 Richmond Valley NaN 0 Staten Island -74.229571 40.519541 2
296 POINT (-73.8266775713865 40.790601556701525) 297 Malba 1 Malba NaN NaN 0 Queens -73.826678 40.790602 4
297 POINT (-73.8902809854761 40.68248551101694) 298 Highland Park 2 Highland Park NaN 0 Brooklyn -73.890281 40.682486 0
298 POINT (-73.948415153289 40.609377701137696) 299 Madison 1 Madison NaN NaN 0 Brooklyn -73.948415 40.609378 0
In [55]:
plt.scatter(x = neighborhood_df['longitude'], y = neighborhood_df['latitude'], c=neighborhood_df['clusters'], s=15, cmap='plasma_r')
Out[55]:
<matplotlib.collections.PathCollection at 0x7c1d8400>

I wanted to check if there are some neighborhoods that share their names with others i.e. whether we have more than 1 neighborhoods with the same names.

In [56]:
neighborhood_df_sorted = neighborhood_df.sort('Name')
C:\Users\Anirban\Anaconda2\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  if __name__ == '__main__':
In [57]:
neighborhood_df_sorted.head()
Out[57]:
the_geom OBJECTID Name Stacked AnnoLine1 AnnoLine2 AnnoLine3 AnnoAngle Borough longitude latitude clusters
217 POINT (-74.17854866165885 40.538114174745104) 218 Annadale 1 Annadale NaN NaN 0 Staten Island -74.178549 40.538114 2
243 POINT (-74.185886745839 40.549285822783254) 244 Arden Heights 2 Arden Heights NaN 0 Staten Island -74.185887 40.549286 2
229 POINT (-74.16510420241131 40.63532509911496) 230 Arlington 1 Arlington NaN NaN 0 Staten Island -74.165104 40.635325 2
230 POINT (-74.06712363225581 40.596312571276776) 231 Arrochar 1 Arrochar NaN NaN 0 Staten Island -74.067124 40.596313 2
179 POINT (-73.7919923313695 40.58914394372975) 180 Arverne 1 Arverne NaN NaN 0 Queens -73.791992 40.589144 1
In [58]:
neighborhood_df_sorted.reset_index(drop=True, inplace = True)
In [59]:
neighborhood_df_sorted.head()
Out[59]:
the_geom OBJECTID Name Stacked AnnoLine1 AnnoLine2 AnnoLine3 AnnoAngle Borough longitude latitude clusters
0 POINT (-74.17854866165885 40.538114174745104) 218 Annadale 1 Annadale NaN NaN 0 Staten Island -74.178549 40.538114 2
1 POINT (-74.185886745839 40.549285822783254) 244 Arden Heights 2 Arden Heights NaN 0 Staten Island -74.185887 40.549286 2
2 POINT (-74.16510420241131 40.63532509911496) 230 Arlington 1 Arlington NaN NaN 0 Staten Island -74.165104 40.635325 2
3 POINT (-74.06712363225581 40.596312571276776) 231 Arrochar 1 Arrochar NaN NaN 0 Staten Island -74.067124 40.596313 2
4 POINT (-73.7919923313695 40.58914394372975) 180 Arverne 1 Arverne NaN NaN 0 Queens -73.791992 40.589144 1
In [60]:
neighborhood_grouped = neighborhood_df_sorted.groupby('Name').count()[['Borough']]
neighborhood_grouped[neighborhood_grouped.Borough > 1]
Out[60]:
Borough
Name
Bay Terrace 2
Chelsea 2
Murray Hill 2
Sunnyside 2

4 of the neighborhoods appear in 2 different boroughs. Let's take a look at these.

In [61]:
neighborhood_df[neighborhood_df.Name.isin(['Bay Terrace','Chelsea','Murray Hill','Sunnyside'])]
Out[61]:
the_geom OBJECTID Name Stacked AnnoLine1 AnnoLine2 AnnoLine3 AnnoAngle Borough longitude latitude clusters
117 POINT (-73.97833207924134 40.74830307725222) 118 Murray Hill 2 Murray Hill NaN 0 Manhattan -73.978332 40.748303 3
118 POINT (-74.0031163347282 40.74403470674802) 119 Chelsea 1 Chelsea NaN NaN 0 Manhattan -74.003116 40.744035 3
142 POINT (-73.92691617561584 40.74017628351928) 143 Sunnyside 1 Sunnyside NaN NaN 0 Queens -73.926916 40.740176 3
177 POINT (-73.77680222621588 40.7828428062456) 178 Bay Terrace 2 Bay Terrace NaN 0 Queens -73.776802 40.782843 1
182 POINT (-73.81276269135873 40.7641261226141) 183 Murray Hill 2 Murray Hill NaN 0 Queens -73.812763 40.764126 1
222 POINT (-74.09712552178537 40.612760157564935) 223 Sunnyside 1 Sunnyside NaN NaN 0 Staten Island -74.097126 40.612760 2
237 POINT (-74.13916622175775 40.55398800858466) 238 Bay Terrace 2 Bay Terrace NaN 0 Staten Island -74.139166 40.553988 2
246 POINT (-74.18956045519697 40.59472602746298) 247 Chelsea 1 Chelsea NaN NaN 0 Staten Island -74.189560 40.594726 2

As we see, Murray Hill belongs to Manhattan as well as Queens, Chelsea belongs to Manhattan as well as Staten Island, and so on. These will be handled later on after we add the neighborhoods data to the trip data.

A Dataframe with information on the centroids is defined below. This will be used to find out which centroid lies nearest to a given pickup point. Only neighborhoods of that particular cluster will be searched when determining the neighborhood of a pickup point.

In [62]:
centroids_df = pd.DataFrame(data=centroids[0], columns=('centroid_lon','centroid_lat'))
centroids_df['cluster'] = centroids_df.index.astype(int)
In [63]:
centroids_df
Out[63]:
centroid_lon centroid_lat cluster
0 -73.936523 40.624029 0
1 -73.797723 40.708430 1
2 -74.138906 40.588390 2
3 -73.967136 40.729870 3
4 -73.879153 40.842910 4

The following method takes as argument the latitude and longitude information and the cluster index. It returns the neighborhood from the specified cluster and is closest to the given latitude and longitude.

In [64]:
def search_cluster(lat,lon,cluster_ind):
    '''
    Determines the New York neighborhood that is most likely to contain a location identified from its latitude and longitude.
    Args:
        lat (float): Latitude of the required location.
        lon (float): Longitude of the required location.
        cluster_ind (int): Index of the cluster of neighborhoods within which the location lies. 
                            There are a total of 5 clusters into which all the neighborhoods have been classified.
    Returns:
        neighborhood_name (str): Name of the neighborhood that is most likely to contain the given location.
        
    '''
    cluster_df = neighborhood_df[neighborhood_df['clusters'] == cluster_ind]

    dist_array = {}
    
    
    for index2, neighborhood in cluster_df.iterrows():
        miles = np.sqrt((lat - neighborhood['latitude'])**2 + (lon - neighborhood['longitude'])**2)
        
        dist_array[miles] = neighborhood['Name']   
    
    return dist_array[min(dist_array)]

Here, I iterate through all the rows in the sample trip dataset and determine which cluster it belongs to. I then call the method search_cluster to determine which neighborhood the pickup point belongs to.

The result is stored as an array of neighborhood names.

Note that this code block takes more than 1 hour to execute. If you don't want to execute this, you can move on and instead pull the new modified dataset that already contains the neighborhood data and has been saved.

In [65]:
%%time


neighborhood_array=[]
for index1, taxi_row in sample_trips_with_location.iterrows():
    
    dist_min = 100
    cluster_ind = 6
    
    for index2, centroid in centroids_df.iterrows():
        dist = np.sqrt((taxi_row['pickup_latitude'] - centroid['centroid_lat'])**2 + (taxi_row['pickup_longitude'] - centroid['centroid_lon'])**2)

        if dist < dist_min:
            dist_min = dist
            cluster_ind = centroid['cluster'].astype(int)
        
    nearest_nbrhd = search_cluster(taxi_row['pickup_latitude'],taxi_row['pickup_longitude'],cluster_ind)
    neighborhood_array.append(nearest_nbrhd)
Wall time: 1h 13min 8s
In [66]:
print neighborhood_array[0:5]
len(neighborhood_array)
['Midtown', 'Noho', 'Clinton', 'Midtown', 'Clinton']
Out[66]:
500000

I insert the neighborhood array into a new column of the trip sample data.

In [67]:
sample_trips_with_location['neighborhood'] = neighborhood_array
In [68]:
sample_trips_with_location.shape
Out[68]:
(500000, 29)
In [69]:
sample_trips_with_location.head()
Out[69]:
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime dropoff_datetime passenger_count trip_time_in_secs trip_distance ... tolls_amount total_amount pickup_hour dropoff_hour tip_fraction pickup_wday pickup_day full_date month_day neighborhood
0 F47E83A52C8401C874B99A98CCE246DF 4E260C3310346E735CBDF8C8E6761D66 VTS 1 NaN 2013-01-30 16:34:00 2013-01-30 16:43:00 1 540 0.86 ... 0.0 7.5 16 16 0.000000 2 Wednesday 2013-01-30 30 Midtown
1 3AA58DEEC2CA4C104FCA781EC38B6DA1 8F986F92BDA7196564FAA2766EFCD3AD CMT 1 N 2013-01-16 22:31:15 2013-01-16 22:37:24 1 368 1.20 ... 0.0 7.5 22 22 0.000000 2 Wednesday 2013-01-16 16 Noho
2 BEEB829C76993A19098AFA1F5C0C88D2 1C3DFBBA24C7057BF7CB1E48E28153F0 CMT 1 N 2013-01-28 14:34:48 2013-01-28 14:47:36 1 767 1.90 ... 0.0 13.2 14 14 0.209524 0 Monday 2013-01-28 28 Clinton
3 3E7AA144EBE43CE6B567524550C78DFF 4AEEA5836012EA84DF0E246D002E1A41 VTS 1 NaN 2013-01-18 18:43:00 2013-01-18 18:50:00 5 420 1.22 ... 0.0 10.1 18 18 0.228571 4 Friday 2013-01-18 18 Midtown
4 A5C6F0FC150417517C2894400795B1F8 97336339EBD6E99950B8D11E083E932B VTS 1 NaN 2013-01-22 00:21:00 2013-01-22 00:37:00 1 960 6.94 ... 0.0 22.0 0 0 0.000000 1 Tuesday 2013-01-22 22 Clinton

5 rows × 29 columns

This dataframe with the neighborhood data is saved in a csv and can be pulled later. This ensures that the cluster search does not have to be executed each time we need this dataset.

In [70]:
%%time

sample_trips_with_location.to_csv('../nyc_taxi_files/2013_Jan_sample_neighborhood.csv',index=False)
Wall time: 13.4 s
In [71]:
%%time

sample_trips_with_location = pd.read_csv('../nyc_taxi_files/2013_Jan_sample_neighborhood.csv')
Wall time: 6.15 s
In [72]:
sample_trips_with_location.head()
Out[72]:
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime dropoff_datetime passenger_count trip_time_in_secs trip_distance ... tolls_amount total_amount pickup_hour dropoff_hour tip_fraction pickup_wday pickup_day full_date month_day neighborhood
0 F47E83A52C8401C874B99A98CCE246DF 4E260C3310346E735CBDF8C8E6761D66 VTS 1 NaN 2013-01-30 16:34:00 2013-01-30 16:43:00 1 540 0.86 ... 0.0 7.5 16 16 0.000000 2 Wednesday 2013-01-30 30 Midtown
1 3AA58DEEC2CA4C104FCA781EC38B6DA1 8F986F92BDA7196564FAA2766EFCD3AD CMT 1 N 2013-01-16 22:31:15 2013-01-16 22:37:24 1 368 1.20 ... 0.0 7.5 22 22 0.000000 2 Wednesday 2013-01-16 16 Noho
2 BEEB829C76993A19098AFA1F5C0C88D2 1C3DFBBA24C7057BF7CB1E48E28153F0 CMT 1 N 2013-01-28 14:34:48 2013-01-28 14:47:36 1 767 1.90 ... 0.0 13.2 14 14 0.209524 0 Monday 2013-01-28 28 Clinton
3 3E7AA144EBE43CE6B567524550C78DFF 4AEEA5836012EA84DF0E246D002E1A41 VTS 1 NaN 2013-01-18 18:43:00 2013-01-18 18:50:00 5 420 1.22 ... 0.0 10.1 18 18 0.228571 4 Friday 2013-01-18 18 Midtown
4 A5C6F0FC150417517C2894400795B1F8 97336339EBD6E99950B8D11E083E932B VTS 1 NaN 2013-01-22 00:21:00 2013-01-22 00:37:00 1 960 6.94 ... 0.0 22.0 0 0 0.000000 1 Tuesday 2013-01-22 22 Clinton

5 rows × 29 columns

I now proceed to look at the number of trips on a particular day in a given neighborhood and in a 1-hour slot. For this, we need to group by ['full_date','pickup_hour','neighborhood'].

In [73]:
sample_trips_with_location.groupby(['full_date','pickup_hour','neighborhood']).count()
Out[73]:
medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime dropoff_datetime passenger_count trip_time_in_secs trip_distance ... surcharge mta_tax tip_amount tolls_amount total_amount dropoff_hour tip_fraction pickup_wday pickup_day month_day
full_date pickup_hour neighborhood
2013-01-01 0 Astoria 2 2 2 2 1 2 2 2 2 2 ... 2 2 2 2 2 2 2 2 2 2
Battery Park City 6 6 6 6 2 6 6 6 6 6 ... 6 6 6 6 6 6 6 6 6 6
Bushwick 1 1 1 1 0 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
Carnegie Hill 31 31 31 31 15 31 31 31 31 31 ... 31 31 31 31 31 31 31 31 31 31
Central Harlem 1 1 1 1 0 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
Chelsea 64 64 64 64 33 64 64 64 64 64 ... 64 64 64 64 64 64 64 64 64 64
Chinatown 11 11 11 11 2 11 11 11 11 11 ... 11 11 11 11 11 11 11 11 11 11
Civic Center 25 25 25 25 11 25 25 25 25 25 ... 25 25 25 25 25 25 25 25 25 25
Clinton 54 54 54 54 25 54 54 54 54 54 ... 54 54 54 54 54 54 54 54 54 54
Dumbo 1 1 1 1 0 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
East Elmhurst 2 2 2 2 0 2 2 2 2 2 ... 2 2 2 2 2 2 2 2 2 2
East Harlem 7 7 7 7 4 7 7 7 7 7 ... 7 7 7 7 7 7 7 7 7 7
East Village 37 37 37 37 17 37 37 37 37 37 ... 37 37 37 37 37 37 37 37 37 37
East Williamsburg 7 7 7 7 3 7 7 7 7 7 ... 7 7 7 7 7 7 7 7 7 7
Financial District 21 21 21 21 15 21 21 21 21 21 ... 21 21 21 21 21 21 21 21 21 21
Flatiron 55 55 55 55 26 55 55 55 55 55 ... 55 55 55 55 55 55 55 55 55 55
Gramercy 39 39 39 39 19 39 39 39 39 39 ... 39 39 39 39 39 39 39 39 39 39
Greenpoint 2 2 2 2 1 2 2 2 2 2 ... 2 2 2 2 2 2 2 2 2 2
Greenwich Village 31 31 31 31 14 31 31 31 31 31 ... 31 31 31 31 31 31 31 31 31 31
Hunters Point 2 2 2 2 2 2 2 2 2 2 ... 2 2 2 2 2 2 2 2 2 2
Lenox Hill 33 33 33 33 18 33 33 33 33 33 ... 33 33 33 33 33 33 33 33 33 33
Lincoln Square 73 73 73 73 35 73 73 73 73 73 ... 73 73 73 73 73 73 73 73 73 73
Little Italy 11 11 11 11 3 11 11 11 11 11 ... 11 11 11 11 11 11 11 11 11 11
Long Island City 4 4 4 4 1 4 4 4 4 4 ... 4 4 4 4 4 4 4 4 4 4
Lower East Side 6 6 6 6 2 6 6 6 6 6 ... 6 6 6 6 6 6 6 6 6 6
Manhattan Valley 14 14 14 14 7 14 14 14 14 14 ... 14 14 14 14 14 14 14 14 14 14
Midtown 56 56 56 56 19 56 56 56 56 56 ... 56 56 56 56 56 56 56 56 56 56
Midtown South 61 61 61 61 28 61 61 61 61 61 ... 61 61 61 61 61 61 61 61 61 61
Murray Hill 37 37 37 37 20 37 37 37 37 37 ... 37 37 37 37 37 37 37 37 37 37
Noho 35 35 35 35 14 35 35 35 35 35 ... 35 35 35 35 35 35 35 35 35 35
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2013-01-31 23 Flatiron 53 53 53 53 28 53 53 53 53 53 ... 53 53 53 53 53 53 53 53 53 53
Fulton Ferry 1 1 1 1 0 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
Gramercy 35 35 35 35 22 35 35 35 35 35 ... 35 35 35 35 35 35 35 35 35 35
Greenwich Village 47 47 47 47 27 47 47 47 47 47 ... 47 47 47 47 47 47 47 47 47 47
Hunters Point 2 2 2 2 1 2 2 2 2 2 ... 2 2 2 2 2 2 2 2 2 2
Lenox Hill 16 16 16 16 3 16 16 16 16 16 ... 16 16 16 16 16 16 16 16 16 16
Lincoln Square 93 93 93 93 51 93 93 93 93 93 ... 93 93 93 93 93 93 93 93 93 93
Little Italy 20 20 20 20 8 20 20 20 20 20 ... 20 20 20 20 20 20 20 20 20 20
Long Island City 3 3 3 3 2 3 3 3 3 3 ... 3 3 3 3 3 3 3 3 3 3
Lower East Side 2 2 2 2 0 2 2 2 2 2 ... 2 2 2 2 2 2 2 2 2 2
Manhattan Valley 13 13 13 13 7 13 13 13 13 13 ... 13 13 13 13 13 13 13 13 13 13
Midtown 120 120 120 120 52 120 120 120 120 120 ... 120 120 120 120 120 120 120 120 120 120
Midtown South 65 65 65 65 40 65 65 65 65 65 ... 65 65 65 65 65 65 65 65 65 65
Murray Hill 39 39 39 39 23 39 39 39 39 39 ... 39 39 39 39 39 39 39 39 39 39
Noho 43 43 43 43 23 43 43 43 43 43 ... 43 43 43 43 43 43 43 43 43 43
North Side 14 14 14 14 6 14 14 14 14 14 ... 14 14 14 14 14 14 14 14 14 14
Ravenswood 1 1 1 1 0 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
Roosevelt Island 3 3 3 3 1 3 3 3 3 3 ... 3 3 3 3 3 3 3 3 3 3
Soho 20 20 20 20 11 20 20 20 20 20 ... 20 20 20 20 20 20 20 20 20 20
South Corona 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
South Side 6 6 6 6 3 6 6 6 6 6 ... 6 6 6 6 6 6 6 6 6 6
Sunnyside Gardens 2 2 2 2 2 2 2 2 2 2 ... 2 2 2 2 2 2 2 2 2 2
Sutton Place 53 53 53 53 24 53 53 53 53 53 ... 53 53 53 53 53 53 53 53 53 53
Tribeca 13 13 13 13 6 13 13 13 13 13 ... 13 13 13 13 13 13 13 13 13 13
Tudor City 9 9 9 9 4 9 9 9 9 9 ... 9 9 9 9 9 9 9 9 9 9
Turtle Bay 49 49 49 49 23 49 49 49 49 49 ... 49 49 49 49 49 49 49 49 49 49
Upper East Side 13 13 13 13 6 13 13 13 13 13 ... 13 13 13 13 13 13 13 13 13 13
Upper West Side 29 29 29 29 15 29 29 29 29 29 ... 29 29 29 29 29 29 29 29 29 29
West Village 28 28 28 28 13 28 28 28 28 28 ... 28 28 28 28 28 28 28 28 28 28
Yorkville 10 10 10 10 4 10 10 10 10 10 ... 10 10 10 10 10 10 10 10 10 10

29553 rows × 26 columns

I define a new dataframe that contains the date, pickup hour, pickup day, neighborhood, number of trips, and month day.

In [74]:
trip_count_df = pd.DataFrame(columns = ['full_date','pickup_hour','pickup_day','neighborhood','month_day','no_of_trips'])
In [75]:
%%time

# All columns except the number of trips are assigned using the .first() corresponding to that column in a particular group
trip_count_df['full_date'] = sample_trips_with_location.groupby(['full_date','pickup_hour','neighborhood'])['full_date'].first().values
trip_count_df['pickup_hour'] = sample_trips_with_location.groupby(['full_date','pickup_hour','neighborhood'])['pickup_hour'].first().values
trip_count_df['neighborhood'] = sample_trips_with_location.groupby(['full_date','pickup_hour','neighborhood'])['neighborhood'].first().values
trip_count_df['pickup_day'] = sample_trips_with_location.groupby(['full_date','pickup_hour','neighborhood'])['pickup_day'].first().values
trip_count_df['month_day'] = sample_trips_with_location.groupby(['full_date','pickup_hour','neighborhood'])['month_day'].first().values

# The number of trips is determined using the .count() method
trip_count_df['no_of_trips'] = sample_trips_with_location.groupby(['full_date','pickup_hour','neighborhood'])['neighborhood'].count().values
Wall time: 1.25 s
In [76]:
trip_count_df.head()
Out[76]:
full_date pickup_hour pickup_day neighborhood month_day no_of_trips
0 2013-01-01 0 Tuesday Astoria 1 2
1 2013-01-01 0 Tuesday Battery Park City 1 6
2 2013-01-01 0 Tuesday Bushwick 1 1
3 2013-01-01 0 Tuesday Carnegie Hill 1 31
4 2013-01-01 0 Tuesday Central Harlem 1 1
In [77]:
trip_count_df.shape
Out[77]:
(29553, 6)
In [78]:
print trip_count_df['neighborhood'].unique().size
print neighborhood_df['Name'].unique().size
82
295

I remove the data corresponding to 'Bay Terrace','Chelsea','Murray Hill', and 'Sunnyside' neighborhoods. These are the neighborhoods that appear more than once.

In [79]:
%%time

trip_count_df = trip_count_df[~(trip_count_df.neighborhood.isin(['Bay Terrace','Chelsea','Murray Hill','Sunnyside']))]
Wall time: 82 ms
In [80]:
trip_count_df.shape
Out[80]:
(27790, 6)

In order to include the details of the neighborhood data such as latitude, longitude, and Borough that it belongs to by joining the trip_count_df dataframe with the neighborhood_df dataframe.

In [91]:
%%time

trip_count_with_neighborhood = trip_count_df.set_index('neighborhood').join(neighborhood_df.set_index('Name'))
Wall time: 31 ms
In [82]:
trip_count_with_neighborhood.shape
Out[82]:
(27790, 16)
In [83]:
trip_count_with_neighborhood.head()
Out[83]:
full_date pickup_hour pickup_day month_day no_of_trips the_geom OBJECTID Stacked AnnoLine1 AnnoLine2 AnnoLine3 AnnoAngle Borough longitude latitude clusters
Astoria 2013-01-01 0 Tuesday 1 2 POINT (-73.91565374304241 40.768508593354966) 132 1 Astoria NaN NaN 0 Queens -73.915654 40.768509 3
Astoria 2013-01-01 1 Tuesday 1 6 POINT (-73.91565374304241 40.768508593354966) 132 1 Astoria NaN NaN 0 Queens -73.915654 40.768509 3
Astoria 2013-01-01 2 Tuesday 1 7 POINT (-73.91565374304241 40.768508593354966) 132 1 Astoria NaN NaN 0 Queens -73.915654 40.768509 3
Astoria 2013-01-01 3 Tuesday 1 10 POINT (-73.91565374304241 40.768508593354966) 132 1 Astoria NaN NaN 0 Queens -73.915654 40.768509 3
Astoria 2013-01-01 4 Tuesday 1 7 POINT (-73.91565374304241 40.768508593354966) 132 1 Astoria NaN NaN 0 Queens -73.915654 40.768509 3

I want to retain only some of the important columns of this dataset. This columns are extracted as shown below.

In [92]:
%%time

trip_count_with_neighborhood = trip_count_with_neighborhood[['full_date','pickup_hour','pickup_day','month_day','Borough','longitude','latitude','no_of_trips']]
trip_count_with_neighborhood['neighborhood'] = trip_count_with_neighborhood.index.values
trip_count_with_neighborhood.reset_index(drop=True,inplace=True)
Wall time: 156 ms
In [93]:
trip_count_with_neighborhood.head()
Out[93]:
full_date pickup_hour pickup_day month_day Borough longitude latitude no_of_trips neighborhood
0 2013-01-01 0 Tuesday 1 Queens -73.915654 40.768509 2 Astoria
1 2013-01-01 1 Tuesday 1 Queens -73.915654 40.768509 6 Astoria
2 2013-01-01 2 Tuesday 1 Queens -73.915654 40.768509 7 Astoria
3 2013-01-01 3 Tuesday 1 Queens -73.915654 40.768509 10 Astoria
4 2013-01-01 4 Tuesday 1 Queens -73.915654 40.768509 7 Astoria
In [94]:
trip_count_with_neighborhood.shape
Out[94]:
(27790, 9)
In [95]:
print trip_count_with_neighborhood.Borough.unique()
print trip_count_with_neighborhood.neighborhood.unique().size
['Queens' 'Manhattan' 'Brooklyn' 'Bronx']
79
In [96]:
%%time

trip_count_with_neighborhood.to_csv('../nyc_taxi_files/2013_Jan_sample_tripcount.csv',index = False)
Wall time: 229 ms

Linear Regression models

I use statsmodels and scikit-learn in order to build a predictive model for the no of taxi trips in a given neighborhood for a given 1 hour time period.

In [97]:
%%time

trip_count_with_neighborhood = pd.read_csv('../nyc_taxi_files/2013_Jan_sample_tripcount.csv')
Wall time: 65 ms
In [98]:
trip_count_with_neighborhood.head()
Out[98]:
full_date pickup_hour pickup_day month_day Borough longitude latitude no_of_trips neighborhood
0 2013-01-01 0 Tuesday 1 Queens -73.915654 40.768509 2 Astoria
1 2013-01-01 1 Tuesday 1 Queens -73.915654 40.768509 6 Astoria
2 2013-01-01 2 Tuesday 1 Queens -73.915654 40.768509 7 Astoria
3 2013-01-01 3 Tuesday 1 Queens -73.915654 40.768509 10 Astoria
4 2013-01-01 4 Tuesday 1 Queens -73.915654 40.768509 7 Astoria

I progressively add the predictors to a ordinary least square linear model and observe the R-squared value. Subsequently, when I fit a scikit-learn to the same data, I will look at the mean square error.

In [99]:
lm = smf.ols(formula='no_of_trips ~ neighborhood', data=trip_count_with_neighborhood).fit() #using neighborhood as the sole predictor
lm.summary()
Out[99]:
OLS Regression Results
Dep. Variable: no_of_trips R-squared: 0.577
Model: OLS Adj. R-squared: 0.576
Method: Least Squares F-statistic: 485.5
Date: Wed, 15 Jun 2016 Prob (F-statistic): 0.00
Time: 00:02:41 Log-Likelihood: -1.1166e+05
No. Observations: 27790 AIC: 2.235e+05
Df Residuals: 27711 BIC: 2.241e+05
Df Model: 78
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 2.0642 0.585 3.528 0.000 0.917 3.211
neighborhood[T.Astoria Heights] -0.1169 0.935 -0.125 0.900 -1.950 1.716
neighborhood[T.Battery Park City] 5.1889 0.787 6.597 0.000 3.647 6.731
neighborhood[T.Blissville] -1.0219 1.702 -0.600 0.548 -4.358 2.314
neighborhood[T.Briarwood] -0.9808 2.320 -0.423 0.672 -5.527 3.566
neighborhood[T.Bushwick] -0.8536 2.262 -0.377 0.706 -5.287 3.580
neighborhood[T.Carnegie Hill] 22.0913 0.769 28.718 0.000 20.584 23.599
neighborhood[T.Central Harlem] -0.9260 1.239 -0.747 0.455 -3.355 1.503
neighborhood[T.Chinatown] 1.4760 0.792 1.863 0.063 -0.077 3.029
neighborhood[T.Civic Center] 6.7012 0.772 8.683 0.000 5.189 8.214
neighborhood[T.Clason Point] -1.0642 9.541 -0.112 0.911 -19.765 17.637
neighborhood[T.Clinton] 29.7893 0.766 38.913 0.000 28.289 31.290
neighborhood[T.College Point] -1.0642 5.529 -0.192 0.847 -11.902 9.774
neighborhood[T.Downtown Flushing] -0.8917 2.568 -0.347 0.728 -5.926 4.143
neighborhood[T.Dumbo] -0.5979 0.962 -0.622 0.534 -2.484 1.288
neighborhood[T.East Elmhurst] 13.4261 0.798 16.826 0.000 11.862 14.990
neighborhood[T.East Harlem] 1.6771 0.779 2.152 0.031 0.149 3.205
neighborhood[T.East Village] 13.3719 0.766 17.463 0.000 11.871 14.873
neighborhood[T.East Williamsburg] 0.1637 1.052 0.156 0.876 -1.899 2.226
neighborhood[T.Elmhurst] -0.8957 1.501 -0.597 0.551 -3.837 2.045
neighborhood[T.Financial District] 8.5488 0.774 11.051 0.000 7.033 10.065
neighborhood[T.Flatiron] 39.4419 0.766 51.508 0.000 37.941 40.943
neighborhood[T.Forest Hills] -0.8987 1.306 -0.688 0.491 -3.459 1.661
neighborhood[T.Forest Hills Gardens] -0.9996 2.489 -0.402 0.688 -5.878 3.878
neighborhood[T.Fulton Ferry] -0.9920 1.487 -0.667 0.505 -3.907 1.923
neighborhood[T.Glendale] -1.0642 6.759 -0.157 0.875 -14.313 12.185
neighborhood[T.Gramercy] 17.9710 0.767 23.442 0.000 16.468 19.474
neighborhood[T.Greenpoint] -0.0869 0.965 -0.090 0.928 -1.978 1.805
neighborhood[T.Greenwich Village] 20.3469 0.766 26.564 0.000 18.846 21.848
neighborhood[T.Hillcrest] -1.0642 13.481 -0.079 0.937 -27.487 25.359
neighborhood[T.Hunters Point] -0.6407 0.966 -0.663 0.507 -2.534 1.253
neighborhood[T.Hunts Point] -1.0642 7.798 -0.136 0.891 -16.348 14.220
neighborhood[T.Jackson Heights] -0.9594 1.439 -0.667 0.505 -3.779 1.860
neighborhood[T.Jamaica Center] -1.0642 13.481 -0.079 0.937 -27.487 25.359
neighborhood[T.Jamaica Hills] -1.0642 9.541 -0.112 0.911 -19.765 17.637
neighborhood[T.Kew Gardens] -0.9872 2.705 -0.365 0.715 -6.290 4.315
neighborhood[T.Kew Gardens Hills] -1.0642 3.145 -0.338 0.735 -7.228 5.100
neighborhood[T.Lefrak City] -0.9742 1.468 -0.663 0.507 -3.852 1.904
neighborhood[T.Lenox Hill] 24.1752 0.768 31.464 0.000 22.669 25.681
neighborhood[T.Lincoln Square] 43.5138 0.766 56.841 0.000 42.013 45.014
neighborhood[T.Little Italy] 6.0846 0.774 7.861 0.000 4.567 7.602
neighborhood[T.Long Island City] -0.4201 0.852 -0.493 0.622 -2.091 1.251
neighborhood[T.Lower East Side] 0.7812 0.813 0.961 0.337 -0.812 2.375
neighborhood[T.Malba] -1.0642 9.541 -0.112 0.911 -19.765 17.637
neighborhood[T.Manhattan Valley] 7.4000 0.773 9.574 0.000 5.885 8.915
neighborhood[T.Maspeth] -1.0297 2.568 -0.401 0.689 -6.064 4.005
neighborhood[T.Middle Village] -1.0642 4.527 -0.235 0.814 -9.938 7.810
neighborhood[T.Midtown] 65.4486 0.766 85.470 0.000 63.948 66.950
neighborhood[T.Midtown South] 43.6186 0.766 56.978 0.000 42.118 45.119
neighborhood[T.Mott Haven] -0.9272 1.681 -0.551 0.581 -4.223 2.368
neighborhood[T.Navy Yard] -1.0642 7.798 -0.136 0.891 -16.348 14.220
neighborhood[T.Noho] 19.2557 0.766 25.153 0.000 17.755 20.756
neighborhood[T.North Corona] -1.0129 2.235 -0.453 0.650 -5.393 3.367
neighborhood[T.North Side] 3.1334 0.846 3.704 0.000 1.475 4.791
neighborhood[T.Pomonok] -1.0642 7.798 -0.136 0.891 -16.348 14.220
neighborhood[T.Port Morris] -1.0642 2.657 -0.400 0.689 -6.272 4.144
neighborhood[T.Queensboro Hill] -1.0642 3.318 -0.321 0.748 -7.568 5.440
neighborhood[T.Ravenswood] -0.0516 0.817 -0.063 0.950 -1.652 1.549
neighborhood[T.Rego Park] -1.0527 1.558 -0.676 0.499 -4.106 2.001
neighborhood[T.Ridgewood] -1.0642 3.318 -0.321 0.748 -7.568 5.440
neighborhood[T.Roosevelt Island] 0.3517 0.838 0.420 0.675 -1.290 1.993
neighborhood[T.Soho] 5.7913 0.777 7.450 0.000 4.268 7.315
neighborhood[T.South Corona] -1.0187 2.930 -0.348 0.728 -6.762 4.725
neighborhood[T.South Side] 0.2077 0.885 0.235 0.814 -1.526 1.942
neighborhood[T.Steinway] -1.0642 2.657 -0.400 0.689 -6.272 4.144
neighborhood[T.Stuyvesant Town] 0.3141 0.823 0.382 0.703 -1.299 1.927
neighborhood[T.Sunnyside Gardens] -0.3059 0.879 -0.348 0.728 -2.028 1.416
neighborhood[T.Sutton Place] 41.2252 0.766 53.837 0.000 39.724 42.726
neighborhood[T.Tribeca] 5.5640 0.780 7.137 0.000 4.036 7.092
neighborhood[T.Tudor City] 6.9213 0.778 8.892 0.000 5.396 8.447
neighborhood[T.Turtle Bay] 20.1310 0.767 26.252 0.000 18.628 21.634
neighborhood[T.Upper East Side] 23.0066 0.774 29.731 0.000 21.490 24.523
neighborhood[T.Upper West Side] 23.0923 0.767 30.089 0.000 21.588 24.597
neighborhood[T.Vinegar Hill] -0.7427 2.612 -0.284 0.776 -5.862 4.376
neighborhood[T.West Village] 17.1480 0.766 22.375 0.000 15.646 18.650
neighborhood[T.Whitestone] -1.0642 13.481 -0.079 0.937 -27.487 25.359
neighborhood[T.Williamsburg] -0.7874 1.401 -0.562 0.574 -3.533 1.958
neighborhood[T.Woodside] -0.7110 0.953 -0.746 0.456 -2.580 1.158
neighborhood[T.Yorkville] 12.6567 0.771 16.420 0.000 11.146 14.168
Omnibus: 3844.398 Durbin-Watson: 0.353
Prob(Omnibus): 0.000 Jarque-Bera (JB): 40287.892
Skew: 0.304 Prob(JB): 0.00
Kurtosis: 8.867 Cond. No. 169.

We observe that the R-squared value is decent.

Some neighborhoods have lower p-value than others. I wanted to compare two of those neighborhoods with two that have higher p-value.

In [100]:
trip_count_with_neighborhood[trip_count_with_neighborhood.neighborhood == 'Battery Park City'] #p-value = 0
Out[100]:
full_date pickup_hour pickup_day month_day Borough longitude latitude no_of_trips neighborhood
871 2013-01-01 0 Tuesday 1 Manhattan -74.016869 40.711932 6 Battery Park City
872 2013-01-01 1 Tuesday 1 Manhattan -74.016869 40.711932 4 Battery Park City
873 2013-01-01 2 Tuesday 1 Manhattan -74.016869 40.711932 7 Battery Park City
874 2013-01-01 3 Tuesday 1 Manhattan -74.016869 40.711932 5 Battery Park City
875 2013-01-01 4 Tuesday 1 Manhattan -74.016869 40.711932 4 Battery Park City
876 2013-01-01 5 Tuesday 1 Manhattan -74.016869 40.711932 4 Battery Park City
877 2013-01-01 6 Tuesday 1 Manhattan -74.016869 40.711932 1 Battery Park City
878 2013-01-01 7 Tuesday 1 Manhattan -74.016869 40.711932 2 Battery Park City
879 2013-01-01 8 Tuesday 1 Manhattan -74.016869 40.711932 2 Battery Park City
880 2013-01-01 9 Tuesday 1 Manhattan -74.016869 40.711932 3 Battery Park City
881 2013-01-01 10 Tuesday 1 Manhattan -74.016869 40.711932 5 Battery Park City
882 2013-01-01 11 Tuesday 1 Manhattan -74.016869 40.711932 6 Battery Park City
883 2013-01-01 12 Tuesday 1 Manhattan -74.016869 40.711932 9 Battery Park City
884 2013-01-01 13 Tuesday 1 Manhattan -74.016869 40.711932 8 Battery Park City
885 2013-01-01 14 Tuesday 1 Manhattan -74.016869 40.711932 9 Battery Park City
886 2013-01-01 15 Tuesday 1 Manhattan -74.016869 40.711932 8 Battery Park City
887 2013-01-01 16 Tuesday 1 Manhattan -74.016869 40.711932 5 Battery Park City
888 2013-01-01 17 Tuesday 1 Manhattan -74.016869 40.711932 4 Battery Park City
889 2013-01-01 18 Tuesday 1 Manhattan -74.016869 40.711932 5 Battery Park City
890 2013-01-01 19 Tuesday 1 Manhattan -74.016869 40.711932 1 Battery Park City
891 2013-01-01 20 Tuesday 1 Manhattan -74.016869 40.711932 3 Battery Park City
892 2013-01-01 21 Tuesday 1 Manhattan -74.016869 40.711932 2 Battery Park City
893 2013-01-01 22 Tuesday 1 Manhattan -74.016869 40.711932 3 Battery Park City
894 2013-01-01 23 Tuesday 1 Manhattan -74.016869 40.711932 1 Battery Park City
895 2013-01-02 0 Wednesday 2 Manhattan -74.016869 40.711932 2 Battery Park City
896 2013-01-02 1 Wednesday 2 Manhattan -74.016869 40.711932 1 Battery Park City
897 2013-01-02 2 Wednesday 2 Manhattan -74.016869 40.711932 1 Battery Park City
898 2013-01-02 4 Wednesday 2 Manhattan -74.016869 40.711932 1 Battery Park City
899 2013-01-02 5 Wednesday 2 Manhattan -74.016869 40.711932 1 Battery Park City
900 2013-01-02 6 Wednesday 2 Manhattan -74.016869 40.711932 5 Battery Park City
... ... ... ... ... ... ... ... ... ...
1497 2013-01-30 14 Wednesday 30 Manhattan -74.016869 40.711932 7 Battery Park City
1498 2013-01-30 15 Wednesday 30 Manhattan -74.016869 40.711932 7 Battery Park City
1499 2013-01-30 16 Wednesday 30 Manhattan -74.016869 40.711932 5 Battery Park City
1500 2013-01-30 17 Wednesday 30 Manhattan -74.016869 40.711932 8 Battery Park City
1501 2013-01-30 18 Wednesday 30 Manhattan -74.016869 40.711932 8 Battery Park City
1502 2013-01-30 19 Wednesday 30 Manhattan -74.016869 40.711932 12 Battery Park City
1503 2013-01-30 20 Wednesday 30 Manhattan -74.016869 40.711932 9 Battery Park City
1504 2013-01-30 21 Wednesday 30 Manhattan -74.016869 40.711932 9 Battery Park City
1505 2013-01-30 22 Wednesday 30 Manhattan -74.016869 40.711932 3 Battery Park City
1506 2013-01-30 23 Wednesday 30 Manhattan -74.016869 40.711932 4 Battery Park City
1507 2013-01-31 0 Thursday 31 Manhattan -74.016869 40.711932 1 Battery Park City
1508 2013-01-31 5 Thursday 31 Manhattan -74.016869 40.711932 4 Battery Park City
1509 2013-01-31 6 Thursday 31 Manhattan -74.016869 40.711932 9 Battery Park City
1510 2013-01-31 7 Thursday 31 Manhattan -74.016869 40.711932 15 Battery Park City
1511 2013-01-31 8 Thursday 31 Manhattan -74.016869 40.711932 21 Battery Park City
1512 2013-01-31 9 Thursday 31 Manhattan -74.016869 40.711932 18 Battery Park City
1513 2013-01-31 10 Thursday 31 Manhattan -74.016869 40.711932 13 Battery Park City
1514 2013-01-31 11 Thursday 31 Manhattan -74.016869 40.711932 10 Battery Park City
1515 2013-01-31 12 Thursday 31 Manhattan -74.016869 40.711932 12 Battery Park City
1516 2013-01-31 13 Thursday 31 Manhattan -74.016869 40.711932 9 Battery Park City
1517 2013-01-31 14 Thursday 31 Manhattan -74.016869 40.711932 13 Battery Park City
1518 2013-01-31 15 Thursday 31 Manhattan -74.016869 40.711932 11 Battery Park City
1519 2013-01-31 16 Thursday 31 Manhattan -74.016869 40.711932 6 Battery Park City
1520 2013-01-31 17 Thursday 31 Manhattan -74.016869 40.711932 6 Battery Park City
1521 2013-01-31 18 Thursday 31 Manhattan -74.016869 40.711932 7 Battery Park City
1522 2013-01-31 19 Thursday 31 Manhattan -74.016869 40.711932 7 Battery Park City
1523 2013-01-31 20 Thursday 31 Manhattan -74.016869 40.711932 16 Battery Park City
1524 2013-01-31 21 Thursday 31 Manhattan -74.016869 40.711932 11 Battery Park City
1525 2013-01-31 22 Thursday 31 Manhattan -74.016869 40.711932 5 Battery Park City
1526 2013-01-31 23 Thursday 31 Manhattan -74.016869 40.711932 3 Battery Park City

656 rows × 9 columns

In [101]:
trip_count_with_neighborhood[trip_count_with_neighborhood.neighborhood == 'Astoria Heights'] #p-value = 0.9
Out[101]:
full_date pickup_hour pickup_day month_day Borough longitude latitude no_of_trips neighborhood
530 2013-01-01 4 Tuesday 1 Queens -73.89468 40.770317 1 Astoria Heights
531 2013-01-01 10 Tuesday 1 Queens -73.89468 40.770317 1 Astoria Heights
532 2013-01-01 12 Tuesday 1 Queens -73.89468 40.770317 1 Astoria Heights
533 2013-01-01 14 Tuesday 1 Queens -73.89468 40.770317 2 Astoria Heights
534 2013-01-01 16 Tuesday 1 Queens -73.89468 40.770317 2 Astoria Heights
535 2013-01-01 19 Tuesday 1 Queens -73.89468 40.770317 1 Astoria Heights
536 2013-01-01 22 Tuesday 1 Queens -73.89468 40.770317 1 Astoria Heights
537 2013-01-02 9 Wednesday 2 Queens -73.89468 40.770317 5 Astoria Heights
538 2013-01-02 10 Wednesday 2 Queens -73.89468 40.770317 1 Astoria Heights
539 2013-01-02 11 Wednesday 2 Queens -73.89468 40.770317 1 Astoria Heights
540 2013-01-02 13 Wednesday 2 Queens -73.89468 40.770317 1 Astoria Heights
541 2013-01-02 14 Wednesday 2 Queens -73.89468 40.770317 3 Astoria Heights
542 2013-01-02 16 Wednesday 2 Queens -73.89468 40.770317 1 Astoria Heights
543 2013-01-02 18 Wednesday 2 Queens -73.89468 40.770317 2 Astoria Heights
544 2013-01-02 20 Wednesday 2 Queens -73.89468 40.770317 2 Astoria Heights
545 2013-01-02 22 Wednesday 2 Queens -73.89468 40.770317 1 Astoria Heights
546 2013-01-03 7 Thursday 3 Queens -73.89468 40.770317 1 Astoria Heights
547 2013-01-03 8 Thursday 3 Queens -73.89468 40.770317 5 Astoria Heights
548 2013-01-03 9 Thursday 3 Queens -73.89468 40.770317 3 Astoria Heights
549 2013-01-03 11 Thursday 3 Queens -73.89468 40.770317 2 Astoria Heights
550 2013-01-03 13 Thursday 3 Queens -73.89468 40.770317 2 Astoria Heights
551 2013-01-03 14 Thursday 3 Queens -73.89468 40.770317 3 Astoria Heights
552 2013-01-03 15 Thursday 3 Queens -73.89468 40.770317 1 Astoria Heights
553 2013-01-03 16 Thursday 3 Queens -73.89468 40.770317 1 Astoria Heights
554 2013-01-03 17 Thursday 3 Queens -73.89468 40.770317 1 Astoria Heights
555 2013-01-03 18 Thursday 3 Queens -73.89468 40.770317 2 Astoria Heights
556 2013-01-03 20 Thursday 3 Queens -73.89468 40.770317 2 Astoria Heights
557 2013-01-04 7 Friday 4 Queens -73.89468 40.770317 1 Astoria Heights
558 2013-01-04 8 Friday 4 Queens -73.89468 40.770317 2 Astoria Heights
559 2013-01-04 10 Friday 4 Queens -73.89468 40.770317 1 Astoria Heights
... ... ... ... ... ... ... ... ... ...
841 2013-01-29 23 Tuesday 29 Queens -73.89468 40.770317 1 Astoria Heights
842 2013-01-30 7 Wednesday 30 Queens -73.89468 40.770317 2 Astoria Heights
843 2013-01-30 9 Wednesday 30 Queens -73.89468 40.770317 1 Astoria Heights
844 2013-01-30 10 Wednesday 30 Queens -73.89468 40.770317 1 Astoria Heights
845 2013-01-30 12 Wednesday 30 Queens -73.89468 40.770317 2 Astoria Heights
846 2013-01-30 13 Wednesday 30 Queens -73.89468 40.770317 3 Astoria Heights
847 2013-01-30 14 Wednesday 30 Queens -73.89468 40.770317 1 Astoria Heights
848 2013-01-30 16 Wednesday 30 Queens -73.89468 40.770317 2 Astoria Heights
849 2013-01-30 17 Wednesday 30 Queens -73.89468 40.770317 3 Astoria Heights
850 2013-01-30 18 Wednesday 30 Queens -73.89468 40.770317 1 Astoria Heights
851 2013-01-30 19 Wednesday 30 Queens -73.89468 40.770317 2 Astoria Heights
852 2013-01-30 20 Wednesday 30 Queens -73.89468 40.770317 3 Astoria Heights
853 2013-01-30 22 Wednesday 30 Queens -73.89468 40.770317 2 Astoria Heights
854 2013-01-30 23 Wednesday 30 Queens -73.89468 40.770317 1 Astoria Heights
855 2013-01-31 0 Thursday 31 Queens -73.89468 40.770317 1 Astoria Heights
856 2013-01-31 4 Thursday 31 Queens -73.89468 40.770317 1 Astoria Heights
857 2013-01-31 6 Thursday 31 Queens -73.89468 40.770317 3 Astoria Heights
858 2013-01-31 9 Thursday 31 Queens -73.89468 40.770317 3 Astoria Heights
859 2013-01-31 10 Thursday 31 Queens -73.89468 40.770317 1 Astoria Heights
860 2013-01-31 11 Thursday 31 Queens -73.89468 40.770317 2 Astoria Heights
861 2013-01-31 13 Thursday 31 Queens -73.89468 40.770317 2 Astoria Heights
862 2013-01-31 14 Thursday 31 Queens -73.89468 40.770317 2 Astoria Heights
863 2013-01-31 15 Thursday 31 Queens -73.89468 40.770317 3 Astoria Heights
864 2013-01-31 17 Thursday 31 Queens -73.89468 40.770317 2 Astoria Heights
865 2013-01-31 18 Thursday 31 Queens -73.89468 40.770317 2 Astoria Heights
866 2013-01-31 19 Thursday 31 Queens -73.89468 40.770317 1 Astoria Heights
867 2013-01-31 20 Thursday 31 Queens -73.89468 40.770317 7 Astoria Heights
868 2013-01-31 21 Thursday 31 Queens -73.89468 40.770317 5 Astoria Heights
869 2013-01-31 22 Thursday 31 Queens -73.89468 40.770317 1 Astoria Heights
870 2013-01-31 23 Thursday 31 Queens -73.89468 40.770317 2 Astoria Heights

341 rows × 9 columns

In [102]:
print trip_count_with_neighborhood[trip_count_with_neighborhood.neighborhood == 'Carnegie Hill'].shape #p-value = 0
print trip_count_with_neighborhood[trip_count_with_neighborhood.neighborhood == 'Bushwick'].shape #p-value = 0.706
(727, 9)
(38, 9)

It is possible that neighborhoods with lower number of entries have a higher p-value.

Next, I create dummy variables for the pickup hour. These dummy variables will be additional predictors for the trip count.

In [103]:
hour = pd.get_dummies(trip_count_with_neighborhood.pickup_hour, prefix = 'hour')
trip_count_with_neighborhood = trip_count_with_neighborhood.join(hour[['hour_0','hour_1','hour_2','hour_3', 'hour_4', 'hour_5', 'hour_6', 'hour_7', 'hour_8','hour_9',
                                                'hour_10','hour_11','hour_12','hour_13','hour_14','hour_15','hour_16','hour_17',
                                                'hour_18','hour_19','hour_20','hour_21','hour_22','hour_23']])
In [105]:
sns.heatmap(trip_count_with_neighborhood.drop(['full_date','pickup_hour'], axis=1).corr())
Out[105]:
<matplotlib.axes._subplots.AxesSubplot at 0x2d8246b70>

We notice a correlation between latitude, longitude and the no of trips; also, there is light correlation between the dummy hour variables and the no of trips.

I now perform an OLS fit using neighborhood and the pickup hour as predictors. We observe that the R-squared value has a slight increase.

In [106]:
lm = smf.ols(formula='no_of_trips ~ neighborhood + hour_0 + hour_1 + hour_2 + hour_3 + hour_4 + hour_5 + \
             hour_6 + hour_7 + hour_8 + hour_9 + hour_10 + hour_11 + hour_12 + hour_13 + hour_14 + hour_15 + \
             hour_16 + hour_17 + hour_18 + hour_19 + hour_20 + hour_21 + hour_22 + hour_23', data=trip_count_with_neighborhood).fit()
lm.summary()
Out[106]:
OLS Regression Results
Dep. Variable: no_of_trips R-squared: 0.664
Model: OLS Adj. R-squared: 0.663
Method: Least Squares F-statistic: 542.0
Date: Wed, 15 Jun 2016 Prob (F-statistic): 0.00
Time: 00:21:44 Log-Likelihood: -1.0847e+05
No. Observations: 27790 AIC: 2.171e+05
Df Residuals: 27688 BIC: 2.180e+05
Df Model: 101
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 2.856e+12 3.28e+12 0.870 0.384 -3.57e+12 9.29e+12
neighborhood[T.Astoria Heights] -3.8192 0.882 -4.331 0.000 -5.548 -2.091
neighborhood[T.Battery Park City] 3.6901 0.719 5.133 0.000 2.281 5.099
neighborhood[T.Blissville] -3.1743 1.539 -2.063 0.039 -6.190 -0.159
neighborhood[T.Briarwood] -3.3259 2.072 -1.605 0.108 -7.386 0.735
neighborhood[T.Bushwick] 4.2675 2.043 2.089 0.037 0.263 8.272
neighborhood[T.Carnegie Hill] 21.7497 0.718 30.301 0.000 20.343 23.157
neighborhood[T.Central Harlem] -2.8198 1.122 -2.514 0.012 -5.018 -0.621
neighborhood[T.Chinatown] 0.3731 0.708 0.527 0.598 -1.015 1.761
neighborhood[T.Civic Center] 5.8944 0.703 8.390 0.000 4.517 7.271
neighborhood[T.Clason Point] -6.0031 8.516 -0.705 0.481 -22.695 10.689
neighborhood[T.Clinton] 29.5715 0.687 43.029 0.000 28.224 30.919
neighborhood[T.College Point] -6.8118 4.936 -1.380 0.168 -16.486 2.863
neighborhood[T.Downtown Flushing] 0.4975 2.293 0.217 0.828 -3.996 4.991
neighborhood[T.Dumbo] -3.1906 0.882 -3.619 0.000 -4.919 -1.463
neighborhood[T.East Elmhurst] 10.9531 0.737 14.865 0.000 9.509 12.397
neighborhood[T.East Harlem] 0.7767 0.695 1.117 0.264 -0.586 2.140
neighborhood[T.East Village] 13.0591 0.683 19.113 0.000 11.720 14.398
neighborhood[T.East Williamsburg] 1.7436 0.940 1.854 0.064 -0.100 3.587
neighborhood[T.Elmhurst] 1.0592 1.339 0.791 0.429 -1.565 3.684
neighborhood[T.Financial District] 7.7828 0.690 11.275 0.000 6.430 9.136
neighborhood[T.Flatiron] 39.1314 0.683 57.274 0.000 37.792 40.471
neighborhood[T.Forest Hills] -0.1925 1.166 -0.165 0.869 -2.477 2.092
neighborhood[T.Forest Hills Gardens] -2.6603 2.221 -1.198 0.231 -7.013 1.693
neighborhood[T.Fulton Ferry] -3.8747 1.328 -2.918 0.004 -6.477 -1.272
neighborhood[T.Glendale] -0.3806 6.031 -0.063 0.950 -12.202 11.441
neighborhood[T.Gramercy] 17.5919 0.684 25.719 0.000 16.251 18.933
neighborhood[T.Greenpoint] -0.2512 0.862 -0.291 0.771 -1.941 1.438
neighborhood[T.Greenwich Village] 20.0159 0.683 29.287 0.000 18.676 21.355
neighborhood[T.Hillcrest] -5.4708 12.029 -0.455 0.649 -29.048 18.107
neighborhood[T.Hunters Point] -2.1818 0.862 -2.531 0.011 -3.871 -0.492
neighborhood[T.Hunts Point] -6.6506 6.958 -0.956 0.339 -20.288 6.987
neighborhood[T.Jackson Heights] 2.5609 1.284 1.994 0.046 0.044 5.078
neighborhood[T.Jamaica Center] 4.2797 12.029 0.356 0.722 -19.298 27.858
neighborhood[T.Jamaica Hills] -2.5819 8.514 -0.303 0.762 -19.269 14.105
neighborhood[T.Kew Gardens] -2.3337 2.414 -0.967 0.334 -7.065 2.398
neighborhood[T.Kew Gardens Hills] -4.6563 2.806 -1.659 0.097 -10.157 0.844
neighborhood[T.Lefrak City] -0.1436 1.310 -0.110 0.913 -2.712 2.425
neighborhood[T.Lenox Hill] 23.6820 0.686 34.543 0.000 22.338 25.026
neighborhood[T.Lincoln Square] 43.2207 0.683 63.277 0.000 41.882 44.559
neighborhood[T.Little Italy] 5.3928 0.691 7.808 0.000 4.039 6.747
neighborhood[T.Long Island City] -1.1190 0.761 -1.471 0.141 -2.610 0.372
neighborhood[T.Lower East Side] -0.5483 0.726 -0.756 0.450 -1.970 0.874
neighborhood[T.Malba] -0.3084 8.514 -0.036 0.971 -16.996 16.379
neighborhood[T.Manhattan Valley] 6.6651 0.690 9.665 0.000 5.313 8.017
neighborhood[T.Maspeth] -3.5383 2.292 -1.544 0.123 -8.030 0.954
neighborhood[T.Middle Village] -1.0043 4.041 -0.249 0.804 -8.925 6.916
neighborhood[T.Midtown] 65.1421 0.683 95.346 0.000 63.803 66.481
neighborhood[T.Midtown South] 43.3258 0.683 63.431 0.000 41.987 44.665
neighborhood[T.Mott Haven] -3.3523 1.501 -2.234 0.026 -6.294 -0.410
neighborhood[T.Navy Yard] -7.6660 6.958 -1.102 0.271 -21.304 5.972
neighborhood[T.Noho] 18.9624 0.683 27.761 0.000 17.624 20.301
neighborhood[T.North Corona] 0.2271 1.994 0.114 0.909 -3.682 4.136
neighborhood[T.North Side] 2.6801 0.755 3.548 0.000 1.199 4.161
neighborhood[T.Pomonok] -5.7715 6.958 -0.829 0.407 -19.409 7.866
neighborhood[T.Port Morris] -4.0132 2.371 -1.693 0.091 -8.661 0.634
neighborhood[T.Queensboro Hill] -4.7002 2.962 -1.587 0.113 -10.505 1.105
neighborhood[T.Ravenswood] -0.2743 0.729 -0.377 0.707 -1.702 1.154
neighborhood[T.Rego Park] -2.3283 1.390 -1.675 0.094 -5.053 0.397
neighborhood[T.Ridgewood] 2.2109 2.961 0.747 0.455 -3.594 8.016
neighborhood[T.Roosevelt Island] -2.2769 0.748 -3.044 0.002 -3.743 -0.811
neighborhood[T.Soho] 4.6879 0.694 6.757 0.000 3.328 6.048
neighborhood[T.South Corona] -2.4331 2.615 -0.930 0.352 -7.558 2.692
neighborhood[T.South Side] -0.2785 0.790 -0.353 0.724 -1.826 1.270
neighborhood[T.Steinway] -0.2137 2.372 -0.090 0.928 -4.862 4.435
neighborhood[T.Stuyvesant Town] -1.3289 0.734 -1.810 0.070 -2.768 0.111
neighborhood[T.Sunnyside Gardens] -0.8351 0.784 -1.065 0.287 -2.372 0.702
neighborhood[T.Sutton Place] 40.9135 0.683 59.880 0.000 39.574 42.253
neighborhood[T.Tribeca] 4.4400 0.696 6.382 0.000 3.076 5.804
neighborhood[T.Tudor City] 5.8816 0.695 8.468 0.000 4.520 7.243
neighborhood[T.Turtle Bay] 19.7544 0.684 28.873 0.000 18.413 21.095
neighborhood[T.Upper East Side] 22.1726 0.691 32.111 0.000 20.819 23.526
neighborhood[T.Upper West Side] 22.6840 0.685 33.126 0.000 21.342 24.026
neighborhood[T.Vinegar Hill] -3.1477 2.331 -1.351 0.177 -7.716 1.420
neighborhood[T.West Village] 16.7904 0.684 24.554 0.000 15.450 18.131
neighborhood[T.Whitestone] 9.3752 12.030 0.779 0.436 -14.205 32.955
neighborhood[T.Williamsburg] 0.9205 1.251 0.736 0.462 -1.532 3.373
neighborhood[T.Woodside] -0.4182 0.851 -0.491 0.623 -2.086 1.250
neighborhood[T.Yorkville] 12.0526 0.688 17.524 0.000 10.704 13.401
hour_0 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_1 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_2 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_3 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_4 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_5 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_6 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_7 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_8 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_9 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_10 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_11 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_12 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_13 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_14 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_15 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_16 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_17 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_18 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_19 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_20 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_21 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_22 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
hour_23 -2.856e+12 3.28e+12 -0.870 0.384 -9.29e+12 3.57e+12
Omnibus: 6086.828 Durbin-Watson: 0.433
Prob(Omnibus): 0.000 Jarque-Bera (JB): 48538.485
Skew: 0.836 Prob(JB): 0.00
Kurtosis: 9.255 Cond. No. 2.35e+14

Next I perform an OLS with neighborhood, pickup hour, and the day of the week.

We observe only a slight increase beyond what we had seen earlier. It is likely that one of the reasons the pickup day is not a significant predictor is the fact that all weekdays have similar patterns. Perhaps a column that classifies whether the day is a weekday or weekend, can act as a better predictor.

In [107]:
lm = smf.ols(formula='no_of_trips ~ neighborhood + pickup_day + hour_0 + hour_1 + hour_2 + hour_3 + hour_4 + hour_5 + \
             hour_6 + hour_7 + hour_8 + hour_9 + hour_10 + hour_11 + hour_12 + hour_13 + hour_14 + hour_15 + \
             hour_16 + hour_17 + hour_18 + hour_19 + hour_20 + hour_21 + hour_22 + hour_23', data=trip_count_with_neighborhood).fit()
lm.summary()
Out[107]:
OLS Regression Results
Dep. Variable: no_of_trips R-squared: 0.667
Model: OLS Adj. R-squared: 0.666
Method: Least Squares F-statistic: 517.7
Date: Wed, 15 Jun 2016 Prob (F-statistic): 0.00
Time: 00:22:35 Log-Likelihood: -1.0835e+05
No. Observations: 27790 AIC: 2.169e+05
Df Residuals: 27682 BIC: 2.178e+05
Df Model: 107
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 3.129e+12 3.28e+12 0.954 0.340 -3.3e+12 9.55e+12
neighborhood[T.Astoria Heights] -3.7933 0.884 -4.292 0.000 -5.526 -2.061
neighborhood[T.Battery Park City] 3.7126 0.715 5.189 0.000 2.310 5.115
neighborhood[T.Blissville] -3.0397 1.535 -1.981 0.048 -6.048 -0.032
neighborhood[T.Briarwood] -3.1203 2.064 -1.512 0.131 -7.165 0.924
neighborhood[T.Bushwick] 4.2335 2.035 2.080 0.038 0.245 8.222
neighborhood[T.Carnegie Hill] 21.7782 0.714 30.483 0.000 20.378 23.179
neighborhood[T.Central Harlem] -2.8760 1.118 -2.572 0.010 -5.068 -0.684
neighborhood[T.Chinatown] 0.3496 0.705 0.496 0.620 -1.033 1.732
neighborhood[T.Civic Center] 5.8863 0.700 8.405 0.000 4.514 7.259
neighborhood[T.Clason Point] -6.4120 8.483 -0.756 0.450 -23.039 10.215
neighborhood[T.Clinton] 29.5997 0.685 43.207 0.000 28.257 30.942
neighborhood[T.College Point] -7.5442 4.917 -1.534 0.125 -17.182 2.093
neighborhood[T.Downtown Flushing] 0.6090 2.284 0.267 0.790 -3.868 5.085
neighborhood[T.Dumbo] -3.3148 0.878 -3.775 0.000 -5.036 -1.594
neighborhood[T.East Elmhurst] 10.9676 0.733 14.962 0.000 9.531 12.404
neighborhood[T.East Harlem] 0.7960 0.693 1.149 0.251 -0.562 2.154
neighborhood[T.East Village] 13.0809 0.681 19.220 0.000 11.747 14.415
neighborhood[T.East Williamsburg] 1.6612 0.937 1.774 0.076 -0.175 3.497
neighborhood[T.Elmhurst] 1.2178 1.334 0.913 0.361 -1.397 3.832
neighborhood[T.Financial District] 7.7849 0.688 11.322 0.000 6.437 9.133
neighborhood[T.Flatiron] 39.1471 0.681 57.517 0.000 37.813 40.481
neighborhood[T.Forest Hills] -0.2266 1.161 -0.195 0.845 -2.502 2.049
neighborhood[T.Forest Hills Gardens] -2.4274 2.213 -1.097 0.273 -6.764 1.909
neighborhood[T.Fulton Ferry] -4.1114 1.323 -3.108 0.002 -6.704 -1.519
neighborhood[T.Glendale] -1.2314 6.008 -0.205 0.838 -13.008 10.545
neighborhood[T.Gramercy] 17.6023 0.681 25.834 0.000 16.267 18.938
neighborhood[T.Greenpoint] -0.2900 0.859 -0.338 0.736 -1.973 1.393
neighborhood[T.Greenwich Village] 20.0319 0.681 29.424 0.000 18.697 21.366
neighborhood[T.Hillcrest] -6.1949 11.983 -0.517 0.605 -29.683 17.293
neighborhood[T.Hunters Point] -2.2556 0.859 -2.626 0.009 -3.939 -0.572
neighborhood[T.Hunts Point] -7.4354 6.931 -1.073 0.283 -21.021 6.150
neighborhood[T.Jackson Heights] 2.6161 1.280 2.045 0.041 0.108 5.124
neighborhood[T.Jamaica Center] 4.5378 11.983 0.379 0.705 -18.950 28.026
neighborhood[T.Jamaica Hills] -2.1507 8.481 -0.254 0.800 -18.774 14.473
neighborhood[T.Kew Gardens] -2.0826 2.405 -0.866 0.386 -6.796 2.631
neighborhood[T.Kew Gardens Hills] -4.7616 2.795 -1.703 0.089 -10.241 0.718
neighborhood[T.Lefrak City] -0.1036 1.305 -0.079 0.937 -2.662 2.455
neighborhood[T.Lenox Hill] 23.7036 0.683 34.708 0.000 22.365 25.042
neighborhood[T.Lincoln Square] 43.2395 0.680 63.548 0.000 41.906 44.573
neighborhood[T.Little Italy] 5.3958 0.688 7.842 0.000 4.047 6.744
neighborhood[T.Long Island City] -1.1854 0.758 -1.565 0.118 -2.670 0.299
neighborhood[T.Lower East Side] -0.5527 0.723 -0.765 0.444 -1.969 0.864
neighborhood[T.Malba] 0.7739 8.481 0.091 0.927 -15.850 17.398
neighborhood[T.Manhattan Valley] 6.6772 0.687 9.720 0.000 5.331 8.024
neighborhood[T.Maspeth] -3.7056 2.283 -1.623 0.105 -8.181 0.769
neighborhood[T.Middle Village] -1.5174 4.025 -0.377 0.706 -9.407 6.373
neighborhood[T.Midtown] 65.1606 0.681 95.738 0.000 63.827 66.495
neighborhood[T.Midtown South] 43.3443 0.680 63.702 0.000 42.011 44.678
neighborhood[T.Mott Haven] -3.2735 1.495 -2.189 0.029 -6.204 -0.343
neighborhood[T.Navy Yard] -7.7360 6.932 -1.116 0.264 -21.322 5.850
neighborhood[T.Noho] 18.9807 0.680 27.895 0.000 17.647 20.314
neighborhood[T.North Corona] 0.0901 1.986 0.045 0.964 -3.803 3.983
neighborhood[T.North Side] 2.6771 0.752 3.558 0.000 1.202 4.152
neighborhood[T.Pomonok] -5.6864 6.931 -0.820 0.412 -19.272 7.899
neighborhood[T.Port Morris] -3.9080 2.362 -1.655 0.098 -8.538 0.722
neighborhood[T.Queensboro Hill] -4.6150 2.950 -1.564 0.118 -10.397 1.167
neighborhood[T.Ravenswood] -0.2910 0.726 -0.401 0.688 -1.713 1.131
neighborhood[T.Rego Park] -2.4781 1.385 -1.789 0.074 -5.193 0.236
neighborhood[T.Ridgewood] 2.1576 2.950 0.731 0.465 -3.625 7.940
neighborhood[T.Roosevelt Island] -2.2768 0.745 -3.055 0.002 -3.737 -0.816
neighborhood[T.Soho] 4.7095 0.691 6.815 0.000 3.355 6.064
neighborhood[T.South Corona] -2.4911 2.605 -0.956 0.339 -7.597 2.614
neighborhood[T.South Side] -0.2631 0.787 -0.334 0.738 -1.805 1.279
neighborhood[T.Steinway] -0.0408 2.363 -0.017 0.986 -4.672 4.590
neighborhood[T.Stuyvesant Town] -1.3679 0.732 -1.870 0.062 -2.802 0.066
neighborhood[T.Sunnyside Gardens] -0.8068 0.781 -1.033 0.302 -2.338 0.724
neighborhood[T.Sutton Place] 40.9306 0.681 60.138 0.000 39.597 42.265
neighborhood[T.Tribeca] 4.4774 0.693 6.461 0.000 3.119 5.836
neighborhood[T.Tudor City] 5.8945 0.692 8.519 0.000 4.538 7.251
neighborhood[T.Turtle Bay] 19.7688 0.682 29.005 0.000 18.433 21.105
neighborhood[T.Upper East Side] 22.1930 0.688 32.265 0.000 20.845 23.541
neighborhood[T.Upper West Side] 22.7071 0.682 33.288 0.000 21.370 24.044
neighborhood[T.Vinegar Hill] -2.9732 2.322 -1.281 0.200 -7.524 1.578
neighborhood[T.West Village] 16.8065 0.681 24.673 0.000 15.471 18.142
neighborhood[T.Whitestone] 8.3277 11.984 0.695 0.487 -15.163 31.818
neighborhood[T.Williamsburg] 0.7862 1.246 0.631 0.528 -1.657 3.229
neighborhood[T.Woodside] -0.3705 0.848 -0.437 0.662 -2.032 1.291
neighborhood[T.Yorkville] 12.0739 0.685 17.623 0.000 10.731 13.417
pickup_day[T.Monday] -3.4162 0.284 -12.026 0.000 -3.973 -2.859
pickup_day[T.Saturday] -0.4181 0.278 -1.502 0.133 -0.964 0.128
pickup_day[T.Sunday] -2.5467 0.279 -9.127 0.000 -3.094 -2.000
pickup_day[T.Thursday] -0.7751 0.268 -2.889 0.004 -1.301 -0.249
pickup_day[T.Tuesday] -1.7351 0.268 -6.477 0.000 -2.260 -1.210
pickup_day[T.Wednesday] -1.7050 0.269 -6.338 0.000 -2.232 -1.178
hour_0 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_1 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_2 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_3 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_4 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_5 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_6 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_7 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_8 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_9 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_10 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_11 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_12 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_13 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_14 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_15 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_16 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_17 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_18 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_19 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_20 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_21 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_22 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
hour_23 -3.129e+12 3.28e+12 -0.954 0.340 -9.55e+12 3.3e+12
Omnibus: 6172.173 Durbin-Watson: 0.438
Prob(Omnibus): 0.000 Jarque-Bera (JB): 50233.069
Skew: 0.844 Prob(JB): 0.00
Kurtosis: 9.366 Cond. No. 2.49e+14

I now introduce dummy variables for the neighborhoods. This is required if we want to use a linear model fit in scikit learn.

In [108]:
neighborhood_dummies = pd.get_dummies(trip_count_with_neighborhood.neighborhood, prefix = 'Nbr')
In [109]:
frames = [trip_count_with_neighborhood, neighborhood_dummies]
trip_count_all_dummies = pd.concat(frames, axis=1)
print trip_count_all_dummies.shape
trip_count_all_dummies.head()
(27790, 112)
Out[109]:
full_date pickup_hour pickup_day month_day Borough longitude latitude no_of_trips neighborhood hour_0 ... Nbr_Tudor City Nbr_Turtle Bay Nbr_Upper East Side Nbr_Upper West Side Nbr_Vinegar Hill Nbr_West Village Nbr_Whitestone Nbr_Williamsburg Nbr_Woodside Nbr_Yorkville
0 2013-01-01 0 Tuesday 1 Queens -73.915654 40.768509 2 Astoria 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2013-01-01 1 Tuesday 1 Queens -73.915654 40.768509 6 Astoria 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2013-01-01 2 Tuesday 1 Queens -73.915654 40.768509 7 Astoria 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 2013-01-01 3 Tuesday 1 Queens -73.915654 40.768509 10 Astoria 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 2013-01-01 4 Tuesday 1 Queens -73.915654 40.768509 7 Astoria 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 112 columns

The following method takes in as arguments the input, output, and the regression model to be used. It returns the R-squared value, mean squared error, and also a histogram of the residuals. The method is called subsequently using various predictor variables for building a lasso, ridge, or elasticnet regression model.

In [110]:
def get_linear_model_metrics(X, y, algo):
    # get the pvalue of X given y. Ignore f-stat for now.
    pvals = feature_selection.f_regression(X, y)[1]
    # start with an empty linear regression object
    # .fit() runs the linear regression function on X and y
    algo.fit(X,y)
    residuals = (y-algo.predict(X)).values
    print residuals
    print 'Mean Squared Error:',(sum(residuals**2))/len(y)

    # print the necessary values
#     print 'Coefficients:', algo.coef_
#     print 'y-intercept:', algo.intercept_
    print 'R-Squared:', algo.score(X,y)
    plt.figure()
    plt.hist(residuals, bins=np.ceil(np.sqrt(len(y))))

    # keep the model
    return algo

For the scikit learn linear model, I exclude all the columns except the neighborhood and pickup hour dummy variables.

Ridge Regression gives us the best fit.

In [131]:
X = trip_count_all_dummies.drop(['full_date','pickup_day','pickup_hour','no_of_trips','Borough','longitude','latitude','neighborhood','month_day'], axis=1)
y = trip_count_all_dummies['no_of_trips']

estimators = [
    linear_model.Lasso(),
    linear_model.Ridge(),
    linear_model.ElasticNet(),
]

for est in estimators:
    print est
    get_linear_model_metrics(X, y, est)
    print
Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
[-14.03065035 -10.03065035  -9.03065035 ...,  -4.03065035  -5.03065035
  -6.03065035]
Mean Squared Error: 394.726523576
R-Squared: 0.077837409814

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)
[  2.06658534   9.27722554  12.90518353 ...,  -8.05063394  -8.16124178
  -5.62567444]
Mean Squared Error: 143.784765975
R-Squared: 0.664089124239

ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
[-14.31399405 -10.31399405  -9.31399405 ...,  -4.31399405  -5.31399405
  -6.31399405]
Mean Squared Error: 420.885774516
R-Squared: 0.0167240030278

Combining the weather data

I extract the weather data from the website www.wunderground.com. Subsequently, I join the weather data with the trip count data and use the weather parameters as additional predictors.

In [112]:
weather_df = pd.read_csv('https://www.wunderground.com/history/airport/KNYC/2013/1/1/MonthlyHistory.html?&reqdb.zip=&reqdb.magic=&reqdb.wmo=&MR=1&format=1')
weather_df.head()
Out[112]:
EST Max TemperatureF Mean TemperatureF Min TemperatureF Max Dew PointF MeanDew PointF Min DewpointF Max Humidity Mean Humidity Min Humidity ... Max VisibilityMiles Mean VisibilityMiles Min VisibilityMiles Max Wind SpeedMPH Mean Wind SpeedMPH Max Gust SpeedMPH PrecipitationIn CloudCover Events WindDirDegrees<br />
0 2013-1-1 40 33 26 27 22 9 64 54 44 ... 10 10 10 15 7 26 0.00 5 NaN 283<br />
1 2013-1-2 33 28 22 15 11 9 57 48 39 ... 10 10 10 15 6 22 0.00 0 NaN 299<br />
2 2013-1-3 32 28 24 17 14 12 68 56 43 ... 10 10 10 13 5 20 0.00 2 NaN 272<br />
3 2013-1-4 37 34 30 21 19 18 63 56 48 ... 10 10 10 18 8 28 0.00 3 NaN 268<br />
4 2013-1-5 42 37 32 21 19 16 56 48 39 ... 10 10 10 17 7 26 0.00 1 NaN 285<br />

5 rows × 23 columns

In [113]:
weather_df.columns
Out[113]:
Index([u'EST', u'Max TemperatureF', u'Mean TemperatureF', u'Min TemperatureF',
       u'Max Dew PointF', u'MeanDew PointF', u'Min DewpointF', u'Max Humidity',
       u' Mean Humidity', u' Min Humidity', u' Max Sea Level PressureIn',
       u' Mean Sea Level PressureIn', u' Min Sea Level PressureIn',
       u' Max VisibilityMiles', u' Mean VisibilityMiles',
       u' Min VisibilityMiles', u' Max Wind SpeedMPH', u' Mean Wind SpeedMPH',
       u' Max Gust SpeedMPH', u'PrecipitationIn', u' CloudCover', u' Events',
       u' WindDirDegrees<br />'],
      dtype='object')
In [114]:
weather_df.rename(columns={'Max TemperatureF': 'Max_TemperatureF', 'Mean TemperatureF': 'Mean_TemperatureF', 
                          'Min TemperatureF':'Min_TemperatureF', 'Max Dew PointF':'Max_Dew_PointF',
                          'MeanDew PointF':'MeanDew_PointF', 'Min DewpointF':'Min_DewpointF','Max Humidity':'Max_Humidity',
                           ' Mean Humidity':'Mean_Humidity',' Min Humidity':'Min_Humidity', 
                           ' Max Sea Level PressureIn':'Max_Sea_Level_PressureIn',' Mean Sea Level PressureIn':'Mean_Sea_Level_PressureIn',
                           ' Min Sea Level PressureIn':'Min_Sea_Level_PressureIn',' Max VisibilityMiles':'Max_VisibilityMiles', 
                           ' Mean VisibilityMiles':'Mean_VisibilityMiles',' Min VisibilityMiles':'Min_VisibilityMiles',
                           ' Max Wind SpeedMPH':'Max_Wind_SpeedMPH',' Mean Wind SpeedMPH':'Mean_Wind_SpeedMPH',
                           ' Max Gust SpeedMPH':'Max_Gust_SpeedMPH','PrecipitationIn':'PrecipitationIn', ' CloudCover':'CloudCover', ' Events':'Events',
                           ' WindDirDegrees<br />':'WindDirDegrees'}, inplace=True)
In [115]:
weather_df.columns
Out[115]:
Index([u'EST', u'Max_TemperatureF', u'Mean_TemperatureF', u'Min_TemperatureF',
       u'Max_Dew_PointF', u'MeanDew_PointF', u'Min_DewpointF', u'Max_Humidity',
       u'Mean_Humidity', u'Min_Humidity', u'Max_Sea_Level_PressureIn',
       u'Mean_Sea_Level_PressureIn', u'Min_Sea_Level_PressureIn',
       u'Max_VisibilityMiles', u'Mean_VisibilityMiles', u'Min_VisibilityMiles',
       u'Max_Wind_SpeedMPH', u'Mean_Wind_SpeedMPH', u'Max_Gust_SpeedMPH',
       u'PrecipitationIn', u'CloudCover', u'Events', u'WindDirDegrees'],
      dtype='object')
In [116]:
weather_df['WindDirDegrees'] = weather_df['WindDirDegrees'].apply(lambda x: x.split('<')[0]) #Removing the </br> from the 'WinDirDegrees' column
In [117]:
weather_df.head()
Out[117]:
EST Max_TemperatureF Mean_TemperatureF Min_TemperatureF Max_Dew_PointF MeanDew_PointF Min_DewpointF Max_Humidity Mean_Humidity Min_Humidity ... Max_VisibilityMiles Mean_VisibilityMiles Min_VisibilityMiles Max_Wind_SpeedMPH Mean_Wind_SpeedMPH Max_Gust_SpeedMPH PrecipitationIn CloudCover Events WindDirDegrees
0 2013-1-1 40 33 26 27 22 9 64 54 44 ... 10 10 10 15 7 26 0.00 5 NaN 283
1 2013-1-2 33 28 22 15 11 9 57 48 39 ... 10 10 10 15 6 22 0.00 0 NaN 299
2 2013-1-3 32 28 24 17 14 12 68 56 43 ... 10 10 10 13 5 20 0.00 2 NaN 272
3 2013-1-4 37 34 30 21 19 18 63 56 48 ... 10 10 10 18 8 28 0.00 3 NaN 268
4 2013-1-5 42 37 32 21 19 16 56 48 39 ... 10 10 10 17 7 26 0.00 1 NaN 285

5 rows × 23 columns

In [119]:
print weather_df['Events'].unique()
print weather_df['WindDirDegrees'].unique()
print weather_df['PrecipitationIn'].unique()
[nan 'Rain' 'Fog' 'Fog-Rain' 'Snow' 'Rain-Snow']
['283' '299' '272' '268' '285' '257' '291' '251' '265' '304' '76' '43'
 '160' '316' '38' '356' '318' '323' '237' '270' '297' '292' '311' '284'
 '303' '289' '233' '119' '121']
['0.00' 'T' '0.55' '0.02' '0.09' '0.12' '0.69' '0.07' '0.22' '0.06' '0.04'
 '0.90']
In [120]:
weather_df.loc[(weather_df['PrecipitationIn'] == 'T'),'PrecipitationIn'] = 0 #Replace precipitation 'T' with 0
weather_df['PrecipitationIn'] = weather_df['PrecipitationIn'].astype(float)
In [121]:
weather_df['date'] = weather_df['EST'].apply(lambda x: time.strptime(x, "%Y-%m-%d").tm_mday)
In [123]:
weather_df.head()
Out[123]:
EST Max_TemperatureF Mean_TemperatureF Min_TemperatureF Max_Dew_PointF MeanDew_PointF Min_DewpointF Max_Humidity Mean_Humidity Min_Humidity ... Mean_VisibilityMiles Min_VisibilityMiles Max_Wind_SpeedMPH Mean_Wind_SpeedMPH Max_Gust_SpeedMPH PrecipitationIn CloudCover Events WindDirDegrees date
0 2013-1-1 40 33 26 27 22 9 64 54 44 ... 10 10 15 7 26 0.0 5 NaN 283 1
1 2013-1-2 33 28 22 15 11 9 57 48 39 ... 10 10 15 6 22 0.0 0 NaN 299 2
2 2013-1-3 32 28 24 17 14 12 68 56 43 ... 10 10 13 5 20 0.0 2 NaN 272 3
3 2013-1-4 37 34 30 21 19 18 63 56 48 ... 10 10 18 8 28 0.0 3 NaN 268 4
4 2013-1-5 42 37 32 21 19 16 56 48 39 ... 10 10 17 7 26 0.0 1 NaN 285 5

5 rows × 24 columns

In [122]:
trip_count_all_dummies.head()
Out[122]:
full_date pickup_hour pickup_day month_day Borough longitude latitude no_of_trips neighborhood hour_0 ... Nbr_Tudor City Nbr_Turtle Bay Nbr_Upper East Side Nbr_Upper West Side Nbr_Vinegar Hill Nbr_West Village Nbr_Whitestone Nbr_Williamsburg Nbr_Woodside Nbr_Yorkville
0 2013-01-01 0 Tuesday 1 Queens -73.915654 40.768509 2 Astoria 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2013-01-01 1 Tuesday 1 Queens -73.915654 40.768509 6 Astoria 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2013-01-01 2 Tuesday 1 Queens -73.915654 40.768509 7 Astoria 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 2013-01-01 3 Tuesday 1 Queens -73.915654 40.768509 10 Astoria 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 2013-01-01 4 Tuesday 1 Queens -73.915654 40.768509 7 Astoria 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 112 columns

I now combine the trip data with the weather data using the index 'date' which corresponds to 'month_day' in the trip_count_all_dummies dataframe.

In [124]:
trip_and_weather_df = trip_count_all_dummies.set_index('month_day').join(weather_df.set_index('date'))
trip_and_weather_df['month_day'] = trip_and_weather_df.index.values
trip_and_weather_df.reset_index(drop=True, inplace=True)
trip_and_weather_df.head()
Out[124]:
full_date pickup_hour pickup_day Borough longitude latitude no_of_trips neighborhood hour_0 hour_1 ... Mean_VisibilityMiles Min_VisibilityMiles Max_Wind_SpeedMPH Mean_Wind_SpeedMPH Max_Gust_SpeedMPH PrecipitationIn CloudCover Events WindDirDegrees month_day
0 2013-01-01 0 Tuesday Queens -73.915654 40.768509 2 Astoria 1.0 0.0 ... 10 10 15 7 26 0.0 5 NaN 283 1
1 2013-01-01 1 Tuesday Queens -73.915654 40.768509 6 Astoria 0.0 1.0 ... 10 10 15 7 26 0.0 5 NaN 283 1
2 2013-01-01 2 Tuesday Queens -73.915654 40.768509 7 Astoria 0.0 0.0 ... 10 10 15 7 26 0.0 5 NaN 283 1
3 2013-01-01 3 Tuesday Queens -73.915654 40.768509 10 Astoria 0.0 0.0 ... 10 10 15 7 26 0.0 5 NaN 283 1
4 2013-01-01 4 Tuesday Queens -73.915654 40.768509 7 Astoria 0.0 0.0 ... 10 10 15 7 26 0.0 5 NaN 283 1

5 rows × 135 columns

In [126]:
trip_and_weather_df.columns.values
Out[126]:
array(['full_date', 'pickup_hour', 'pickup_day', 'Borough', 'longitude',
       'latitude', 'no_of_trips', 'neighborhood', 'hour_0', 'hour_1',
       'hour_2', 'hour_3', 'hour_4', 'hour_5', 'hour_6', 'hour_7',
       'hour_8', 'hour_9', 'hour_10', 'hour_11', 'hour_12', 'hour_13',
       'hour_14', 'hour_15', 'hour_16', 'hour_17', 'hour_18', 'hour_19',
       'hour_20', 'hour_21', 'hour_22', 'hour_23', 'Nbr_Astoria',
       'Nbr_Astoria Heights', 'Nbr_Battery Park City', 'Nbr_Blissville',
       'Nbr_Briarwood', 'Nbr_Bushwick', 'Nbr_Carnegie Hill',
       'Nbr_Central Harlem', 'Nbr_Chinatown', 'Nbr_Civic Center',
       'Nbr_Clason Point', 'Nbr_Clinton', 'Nbr_College Point',
       'Nbr_Downtown Flushing', 'Nbr_Dumbo', 'Nbr_East Elmhurst',
       'Nbr_East Harlem', 'Nbr_East Village', 'Nbr_East Williamsburg',
       'Nbr_Elmhurst', 'Nbr_Financial District', 'Nbr_Flatiron',
       'Nbr_Forest Hills', 'Nbr_Forest Hills Gardens', 'Nbr_Fulton Ferry',
       'Nbr_Glendale', 'Nbr_Gramercy', 'Nbr_Greenpoint',
       'Nbr_Greenwich Village', 'Nbr_Hillcrest', 'Nbr_Hunters Point',
       'Nbr_Hunts Point', 'Nbr_Jackson Heights', 'Nbr_Jamaica Center',
       'Nbr_Jamaica Hills', 'Nbr_Kew Gardens', 'Nbr_Kew Gardens Hills',
       'Nbr_Lefrak City', 'Nbr_Lenox Hill', 'Nbr_Lincoln Square',
       'Nbr_Little Italy', 'Nbr_Long Island City', 'Nbr_Lower East Side',
       'Nbr_Malba', 'Nbr_Manhattan Valley', 'Nbr_Maspeth',
       'Nbr_Middle Village', 'Nbr_Midtown', 'Nbr_Midtown South',
       'Nbr_Mott Haven', 'Nbr_Navy Yard', 'Nbr_Noho', 'Nbr_North Corona',
       'Nbr_North Side', 'Nbr_Pomonok', 'Nbr_Port Morris',
       'Nbr_Queensboro Hill', 'Nbr_Ravenswood', 'Nbr_Rego Park',
       'Nbr_Ridgewood', 'Nbr_Roosevelt Island', 'Nbr_Soho',
       'Nbr_South Corona', 'Nbr_South Side', 'Nbr_Steinway',
       'Nbr_Stuyvesant Town', 'Nbr_Sunnyside Gardens', 'Nbr_Sutton Place',
       'Nbr_Tribeca', 'Nbr_Tudor City', 'Nbr_Turtle Bay',
       'Nbr_Upper East Side', 'Nbr_Upper West Side', 'Nbr_Vinegar Hill',
       'Nbr_West Village', 'Nbr_Whitestone', 'Nbr_Williamsburg',
       'Nbr_Woodside', 'Nbr_Yorkville', 'EST', 'Max_TemperatureF',
       'Mean_TemperatureF', 'Min_TemperatureF', 'Max_Dew_PointF',
       'MeanDew_PointF', 'Min_DewpointF', 'Max_Humidity', 'Mean_Humidity',
       'Min_Humidity', 'Max_Sea_Level_PressureIn',
       'Mean_Sea_Level_PressureIn', 'Min_Sea_Level_PressureIn',
       'Max_VisibilityMiles', 'Mean_VisibilityMiles',
       'Min_VisibilityMiles', 'Max_Wind_SpeedMPH', 'Mean_Wind_SpeedMPH',
       'Max_Gust_SpeedMPH', 'PrecipitationIn', 'CloudCover', 'Events',
       'WindDirDegrees', 'month_day'], dtype=object)
In [127]:
sns.heatmap(trip_and_weather_df[['Max_TemperatureF', 'Mean_TemperatureF', 'Min_TemperatureF',\
       'Max_Dew_PointF', 'MeanDew_PointF', 'Min_DewpointF', 'Max_Humidity',\
       'Mean_Humidity', 'Min_Humidity', 'Max_Sea_Level_PressureIn',\
       'Mean_Sea_Level_PressureIn', 'Min_Sea_Level_PressureIn',\
       'Max_VisibilityMiles', 'Mean_VisibilityMiles',\
       'Min_VisibilityMiles', 'Max_Wind_SpeedMPH', 'Mean_Wind_SpeedMPH',\
       'Max_Gust_SpeedMPH', 'PrecipitationIn', 'CloudCover',\
       'WindDirDegrees']].corr())
Out[127]:
<matplotlib.axes._subplots.AxesSubplot at 0x2d6b03b38>

The heatmap of the weather data tells us that attributes such as max and min temperatures, max and min humidity, max and min sea level pressure are autocorrelated. Thus, as predictors, I will choose only the mean values for these weather attributes. All the max and min values are therefore dropped while performing linear model fits in scikit-learn and statsmodels.

In [134]:
X = trip_and_weather_df.drop(['full_date','pickup_day','pickup_hour','no_of_trips','Borough','longitude','latitude',\
                              'neighborhood','month_day','Events','Max_TemperatureF','Min_TemperatureF','Max_Dew_PointF',\
                              'Min_DewpointF','Max_Humidity','Min_Humidity', 'Max_Sea_Level_PressureIn','Min_Sea_Level_PressureIn',\
                              'Max_VisibilityMiles','Min_VisibilityMiles','Max_Wind_SpeedMPH','EST','month_day','Events'], axis=1)
y = trip_and_weather_df['no_of_trips']

estimators = [
    linear_model.Lasso(),
    linear_model.Ridge(),
    linear_model.ElasticNet(),
]

for est in estimators:
    print est
    get_linear_model_metrics(X, y, est)
    print
Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
[-13.79953573  -9.79953573  -8.79953573 ...,  -4.33230089  -5.33230089
  -6.33230089]
Mean Squared Error: 394.158334267
R-Squared: 0.0791648172553

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)
[  3.53152498  10.70624067  14.33421039 ...,  -9.66544016  -9.7766061
  -7.24979329]
Mean Squared Error: 142.386189653
R-Squared: 0.667356487049

ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
[-13.87091042  -9.87091042  -8.87091042 ...,  -4.8660534   -5.8660534
  -6.8660534 ]
Mean Squared Error: 420.207499437
R-Squared: 0.0183085935392

Once again, Ridge Regression gives us the best fit.

I also perform another linear model fit with statsmodels which gives me simialr results as the Ridge regression fit on scikit-learn.

In [135]:
lm = smf.ols(formula='no_of_trips ~ neighborhood + pickup_day + hour_0 + hour_1 + hour_2 + hour_3 + hour_4 + hour_5 + \
             hour_6 + hour_7 + hour_8 + hour_9 + hour_10 + hour_11 + hour_12 + hour_13 + hour_14 + hour_15 + \
             hour_16 + hour_17 + hour_18 + hour_19 + hour_20 + hour_21 + hour_22 + hour_23 + \
             Mean_TemperatureF + MeanDew_PointF + Mean_Humidity + Mean_Sea_Level_PressureIn + Mean_VisibilityMiles + \
             Mean_Wind_SpeedMPH + Max_Gust_SpeedMPH + PrecipitationIn + CloudCover', data=trip_and_weather_df).fit()
lm.summary()
Out[135]:
OLS Regression Results
Dep. Variable: no_of_trips R-squared: 0.669
Model: OLS Adj. R-squared: 0.667
Method: Least Squares F-statistic: 481.4
Date: Wed, 15 Jun 2016 Prob (F-statistic): 0.00
Time: 01:07:09 Log-Likelihood: -1.0828e+05
No. Observations: 27790 AIC: 2.168e+05
Df Residuals: 27673 BIC: 2.178e+05
Df Model: 116
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -133.3154 20.294 -6.569 0.000 -173.092 -93.538
neighborhood[T.Astoria Heights] -3.5950 0.830 -4.329 0.000 -5.223 -1.967
neighborhood[T.Battery Park City] 3.5699 0.698 5.118 0.000 2.203 4.937
neighborhood[T.Blissville] -3.2335 1.510 -2.142 0.032 -6.192 -0.275
neighborhood[T.Briarwood] -3.1801 2.058 -1.546 0.122 -7.213 0.853
neighborhood[T.Bushwick] 4.5344 2.008 2.258 0.024 0.598 8.471
neighborhood[T.Carnegie Hill] 21.5885 0.682 31.659 0.000 20.252 22.925
neighborhood[T.Central Harlem] -2.7562 1.099 -2.508 0.012 -4.910 -0.602
neighborhood[T.Chinatown] 0.3913 0.703 0.557 0.578 -0.986 1.768
neighborhood[T.Civic Center] 6.0103 0.684 8.785 0.000 4.669 7.351
neighborhood[T.Clason Point] -6.8035 8.461 -0.804 0.421 -23.388 9.781
neighborhood[T.Clinton] 29.5181 0.679 43.497 0.000 28.188 30.848
neighborhood[T.College Point] -7.1763 4.903 -1.464 0.143 -16.787 2.434
neighborhood[T.Downtown Flushing] 0.6422 2.278 0.282 0.778 -3.822 5.106
neighborhood[T.Dumbo] -3.1612 0.853 -3.705 0.000 -4.834 -1.489
neighborhood[T.East Elmhurst] 11.1326 0.708 15.726 0.000 9.745 12.520
neighborhood[T.East Harlem] 0.7964 0.691 1.153 0.249 -0.558 2.151
neighborhood[T.East Village] 13.0824 0.679 19.273 0.000 11.752 14.413
neighborhood[T.East Williamsburg] 1.7524 0.934 1.876 0.061 -0.079 3.584
neighborhood[T.Elmhurst] 1.1816 1.331 0.888 0.375 -1.426 3.790
neighborhood[T.Financial District] 7.7830 0.686 11.349 0.000 6.439 9.127
neighborhood[T.Flatiron] 39.1504 0.679 57.675 0.000 37.820 40.481
neighborhood[T.Forest Hills] -0.2112 1.158 -0.182 0.855 -2.481 2.059
neighborhood[T.Forest Hills Gardens] -2.5263 2.207 -1.145 0.252 -6.852 1.800
neighborhood[T.Fulton Ferry] -4.1311 1.319 -3.131 0.002 -6.717 -1.545
neighborhood[T.Glendale] -0.8960 5.993 -0.150 0.881 -12.643 10.851
neighborhood[T.Gramercy] 17.6051 0.680 25.906 0.000 16.273 18.937
neighborhood[T.Greenpoint] -0.3191 0.856 -0.373 0.709 -1.998 1.360
neighborhood[T.Greenwich Village] 20.0350 0.679 29.507 0.000 18.704 21.366
neighborhood[T.Hillcrest] -7.3553 11.956 -0.615 0.538 -30.789 16.078
neighborhood[T.Hunters Point] -2.2377 0.857 -2.612 0.009 -3.917 -0.559
neighborhood[T.Hunts Point] -6.3357 6.914 -0.916 0.359 -19.888 7.216
neighborhood[T.Jackson Heights] 2.6530 1.276 2.079 0.038 0.152 5.155
neighborhood[T.Jamaica Center] 6.1520 11.955 0.515 0.607 -17.280 29.584
neighborhood[T.Jamaica Hills] -2.2782 8.460 -0.269 0.788 -18.861 14.304
neighborhood[T.Kew Gardens] -1.9779 2.399 -0.825 0.410 -6.680 2.724
neighborhood[T.Kew Gardens Hills] -4.7236 2.788 -1.694 0.090 -10.189 0.741
neighborhood[T.Lefrak City] -0.0144 1.302 -0.011 0.991 -2.567 2.538
neighborhood[T.Lenox Hill] 23.7073 0.681 34.806 0.000 22.372 25.042
neighborhood[T.Lincoln Square] 43.2426 0.679 63.721 0.000 41.912 44.573
neighborhood[T.Little Italy] 5.3952 0.686 7.862 0.000 4.050 6.740
neighborhood[T.Long Island City] -1.2455 0.756 -1.648 0.099 -2.727 0.236
neighborhood[T.Lower East Side] -0.5517 0.721 -0.765 0.444 -1.965 0.861
neighborhood[T.Malba] 0.6172 8.461 0.073 0.942 -15.966 17.200
neighborhood[T.Manhattan Valley] 6.6933 0.685 9.769 0.000 5.350 8.036
neighborhood[T.Maspeth] -3.4271 2.278 -1.505 0.132 -7.892 1.037
neighborhood[T.Middle Village] -1.3449 4.016 -0.335 0.738 -9.217 6.527
neighborhood[T.Midtown] 65.1632 0.679 95.997 0.000 63.833 66.494
neighborhood[T.Midtown South] 43.3474 0.679 63.876 0.000 42.017 44.678
neighborhood[T.Mott Haven] -3.2039 1.491 -2.148 0.032 -6.127 -0.281
neighborhood[T.Navy Yard] -7.9482 6.916 -1.149 0.250 -21.503 5.607
neighborhood[T.Noho] 18.9845 0.679 27.975 0.000 17.654 20.315
neighborhood[T.North Corona] -0.0050 1.982 -0.003 0.998 -3.889 3.879
neighborhood[T.North Side] 2.6954 0.750 3.592 0.000 1.224 4.166
neighborhood[T.Pomonok] -5.9819 6.914 -0.865 0.387 -19.533 7.569
neighborhood[T.Port Morris] -4.0804 2.357 -1.732 0.083 -8.699 0.538
neighborhood[T.Queensboro Hill] -4.4247 2.943 -1.504 0.133 -10.192 1.343
neighborhood[T.Ravenswood] -0.2690 0.724 -0.372 0.710 -1.688 1.150
neighborhood[T.Rego Park] -2.3424 1.381 -1.696 0.090 -5.050 0.365
neighborhood[T.Ridgewood] 2.6841 2.943 0.912 0.362 -3.084 8.453
neighborhood[T.Roosevelt Island] -2.2913 0.743 -3.083 0.002 -3.748 -0.835
neighborhood[T.Soho] 4.7183 0.689 6.846 0.000 3.367 6.069
neighborhood[T.South Corona] -2.4565 2.598 -0.946 0.344 -7.549 2.636
neighborhood[T.South Side] -0.2533 0.785 -0.323 0.747 -1.791 1.285
neighborhood[T.Steinway] 0.0831 2.357 0.035 0.972 -4.536 4.703
neighborhood[T.Stuyvesant Town] -1.3713 0.730 -1.879 0.060 -2.801 0.059
neighborhood[T.Sunnyside Gardens] -0.8231 0.779 -1.057 0.291 -2.350 0.704
neighborhood[T.Sutton Place] 40.9337 0.679 60.302 0.000 39.603 42.264
neighborhood[T.Tribeca] 4.4608 0.691 6.454 0.000 3.106 5.816
neighborhood[T.Tudor City] 5.8923 0.690 8.539 0.000 4.540 7.245
neighborhood[T.Turtle Bay] 19.7735 0.680 29.089 0.000 18.441 21.106
neighborhood[T.Upper East Side] 22.2021 0.686 32.364 0.000 20.857 23.547
neighborhood[T.Upper West Side] 22.7118 0.680 33.383 0.000 21.378 24.045
neighborhood[T.Vinegar Hill] -3.2016 2.317 -1.382 0.167 -7.743 1.340
neighborhood[T.West Village] 16.8112 0.679 24.745 0.000 15.480 18.143
neighborhood[T.Whitestone] 7.7166 11.955 0.645 0.519 -15.717 31.150
neighborhood[T.Williamsburg] 0.8082 1.243 0.650 0.516 -1.629 3.245
neighborhood[T.Woodside] -0.2853 0.846 -0.337 0.736 -1.943 1.372
neighborhood[T.Yorkville] 12.0776 0.683 17.675 0.000 10.738 13.417
pickup_day[T.Monday] -1.7536 0.359 -4.891 0.000 -2.456 -1.051
pickup_day[T.Saturday] -0.4711 0.325 -1.450 0.147 -1.108 0.166
pickup_day[T.Sunday] -2.2211 0.318 -6.992 0.000 -2.844 -1.598
pickup_day[T.Thursday] -0.3807 0.308 -1.237 0.216 -0.984 0.223
pickup_day[T.Tuesday] -2.1409 0.319 -6.713 0.000 -2.766 -1.516
pickup_day[T.Wednesday] -1.0679 0.309 -3.456 0.001 -1.674 -0.462
hour_0 -8.0090 0.911 -8.794 0.000 -9.794 -6.224
hour_1 -11.1870 0.914 -12.240 0.000 -12.978 -9.395
hour_2 -13.8061 0.917 -15.057 0.000 -15.603 -12.009
hour_3 -16.2926 0.924 -17.640 0.000 -18.103 -14.482
hour_4 -18.4563 0.926 -19.936 0.000 -20.271 -16.642
hour_5 -19.3255 0.922 -20.954 0.000 -21.133 -17.518
hour_6 -13.8181 0.917 -15.069 0.000 -15.616 -12.021
hour_7 -7.2344 0.912 -7.934 0.000 -9.022 -5.447
hour_8 -3.2645 0.911 -3.582 0.000 -5.051 -1.478
hour_9 -2.6974 0.912 -2.956 0.003 -4.486 -0.909
hour_10 -4.0746 0.913 -4.464 0.000 -5.864 -2.286
hour_11 -3.2049 0.912 -3.515 0.000 -4.992 -1.418
hour_12 -2.1469 0.912 -2.354 0.019 -3.935 -0.359
hour_13 -1.9966 0.913 -2.188 0.029 -3.785 -0.208
hour_14 -1.4199 0.911 -1.559 0.119 -3.206 0.366
hour_15 -1.9144 0.911 -2.102 0.036 -3.700 -0.129
hour_16 -4.2609 0.912 -4.670 0.000 -6.049 -2.473
hour_17 -1.5655 0.913 -1.714 0.086 -3.355 0.224
hour_18 2.7505 0.909 3.024 0.002 0.968 4.533
hour_19 2.9484 0.909 3.243 0.001 1.166 4.730
hour_20 0.7563 0.910 0.831 0.406 -1.027 2.540
hour_21 0.0762 0.911 0.084 0.933 -1.708 1.861
hour_22 -0.8289 0.909 -0.912 0.362 -2.610 0.952
hour_23 -4.3434 0.908 -4.784 0.000 -6.123 -2.564
Mean_TemperatureF -0.0388 0.050 -0.769 0.442 -0.138 0.060
MeanDew_PointF -0.0141 0.052 -0.271 0.786 -0.116 0.088
Mean_Humidity 0.0972 0.023 4.203 0.000 0.052 0.143
Mean_Sea_Level_PressureIn 4.3596 0.683 6.384 0.000 3.021 5.698
Mean_VisibilityMiles 0.0382 0.086 0.444 0.657 -0.130 0.207
Mean_Wind_SpeedMPH 0.4694 0.086 5.430 0.000 0.300 0.639
Max_Gust_SpeedMPH 0.1238 0.026 4.830 0.000 0.074 0.174
PrecipitationIn -0.1131 0.511 -0.221 0.825 -1.116 0.889
CloudCover 0.1187 0.063 1.897 0.058 -0.004 0.241
Omnibus: 6160.856 Durbin-Watson: 0.511
Prob(Omnibus): 0.000 Jarque-Bera (JB): 50316.037
Skew: 0.842 Prob(JB): 0.00
Kurtosis: 9.373 Cond. No. 6.14e+16

Decision Tree Regressor

Finally, I perform a decision treee regression with cross-validation using a train-test split on the dataset. I do a grid search to find the optimum max_depth for the tree. The min_samples_split has been assigned to a default 2.

In [140]:
X = trip_and_weather_df.drop(['full_date','pickup_day','pickup_hour','no_of_trips','Borough','longitude','latitude',\
                              'neighborhood','month_day','Events','Max_TemperatureF','Min_TemperatureF','Max_Dew_PointF',\
                              'Min_DewpointF','Max_Humidity','Min_Humidity', 'Max_Sea_Level_PressureIn','Min_Sea_Level_PressureIn',\
                              'Max_VisibilityMiles','Min_VisibilityMiles','Max_Wind_SpeedMPH','EST','month_day','Events'], axis=1)

y = trip_and_weather_df['no_of_trips']

train_data,test_data,train_op,test_op = train_test_split(X,y,test_size=0.2)
print 'Train Data size: %s' % (train_data.shape,)
print 'Test Data size: %s' % (test_data.shape,)

for max_depth in range(1, 26):
    regressor = DecisionTreeRegressor(criterion='mse', max_depth=max_depth, max_features=None,
               max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, presort=False, random_state=None,
               splitter='best')
    print(regressor)
    regressor.fit(train_data,train_op)
    print 'Train Data regressor score: %f' %(regressor.score(train_data,train_op))

    op = regressor.predict(test_data)

    print 'Mean Squared Error: %f' %sklearn.metrics.mean_squared_error(test_op,op)
    print 'Median Absolute Error: %f' %sklearn.metrics.median_absolute_error(test_op,op)
    print 'Test set R-squared: %f' %sklearn.metrics.r2_score(test_op,op)
Train Data size: (22232, 113)
Test Data size: (5558, 113)
DecisionTreeRegressor(criterion='mse', max_depth=1, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.171595
Mean Squared Error: 355.657680
Median Absolute Error: 12.008088
Test set R-squared: 0.151092
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.244180
Mean Squared Error: 333.656682
Median Absolute Error: 11.893222
Test set R-squared: 0.203605
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.319387
Mean Squared Error: 298.537050
Median Absolute Error: 11.198120
Test set R-squared: 0.287431
DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.393707
Mean Squared Error: 267.127920
Median Absolute Error: 10.304834
Test set R-squared: 0.362401
DecisionTreeRegressor(criterion='mse', max_depth=5, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.469157
Mean Squared Error: 236.261662
Median Absolute Error: 9.395748
Test set R-squared: 0.436074
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.518377
Mean Squared Error: 218.357658
Median Absolute Error: 8.747533
Test set R-squared: 0.478809
DecisionTreeRegressor(criterion='mse', max_depth=7, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.555583
Mean Squared Error: 196.884756
Median Absolute Error: 8.255878
Test set R-squared: 0.530062
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.590756
Mean Squared Error: 184.783335
Median Absolute Error: 7.735639
Test set R-squared: 0.558946
DecisionTreeRegressor(criterion='mse', max_depth=9, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.624129
Mean Squared Error: 171.817707
Median Absolute Error: 7.219914
Test set R-squared: 0.589894
DecisionTreeRegressor(criterion='mse', max_depth=10, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.656042
Mean Squared Error: 160.905431
Median Absolute Error: 6.691977
Test set R-squared: 0.615940
DecisionTreeRegressor(criterion='mse', max_depth=11, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.686362
Mean Squared Error: 150.586928
Median Absolute Error: 6.754153
Test set R-squared: 0.640569
DecisionTreeRegressor(criterion='mse', max_depth=12, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.716481
Mean Squared Error: 143.892267
Median Absolute Error: 6.395802
Test set R-squared: 0.656548
DecisionTreeRegressor(criterion='mse', max_depth=13, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.746800
Mean Squared Error: 131.920283
Median Absolute Error: 6.051507
Test set R-squared: 0.685124
DecisionTreeRegressor(criterion='mse', max_depth=14, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.774534
Mean Squared Error: 125.372211
Median Absolute Error: 5.507581
Test set R-squared: 0.700753
DecisionTreeRegressor(criterion='mse', max_depth=15, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.802100
Mean Squared Error: 117.550265
Median Absolute Error: 4.969581
Test set R-squared: 0.719423
DecisionTreeRegressor(criterion='mse', max_depth=16, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.824897
Mean Squared Error: 107.979969
Median Absolute Error: 4.538045
Test set R-squared: 0.742266
DecisionTreeRegressor(criterion='mse', max_depth=17, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.845354
Mean Squared Error: 105.598466
Median Absolute Error: 4.106764
Test set R-squared: 0.747950
DecisionTreeRegressor(criterion='mse', max_depth=18, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.864937
Mean Squared Error: 100.920520
Median Absolute Error: 3.685093
Test set R-squared: 0.759116
DecisionTreeRegressor(criterion='mse', max_depth=19, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.881167
Mean Squared Error: 100.209951
Median Absolute Error: 3.383876
Test set R-squared: 0.760812
DecisionTreeRegressor(criterion='mse', max_depth=20, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.895547
Mean Squared Error: 98.451269
Median Absolute Error: 3.116167
Test set R-squared: 0.765010
DecisionTreeRegressor(criterion='mse', max_depth=21, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.908608
Mean Squared Error: 99.913688
Median Absolute Error: 2.847880
Test set R-squared: 0.761519
DecisionTreeRegressor(criterion='mse', max_depth=22, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.920256
Mean Squared Error: 99.392148
Median Absolute Error: 2.553072
Test set R-squared: 0.762764
DecisionTreeRegressor(criterion='mse', max_depth=23, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.930785
Mean Squared Error: 98.616135
Median Absolute Error: 2.733361
Test set R-squared: 0.764616
DecisionTreeRegressor(criterion='mse', max_depth=24, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.940246
Mean Squared Error: 101.918434
Median Absolute Error: 2.751899
Test set R-squared: 0.756734
DecisionTreeRegressor(criterion='mse', max_depth=25, max_features=None,
           max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')
Train Data regressor score: 0.948325
Mean Squared Error: 99.977798
Median Absolute Error: 2.636364
Test set R-squared: 0.761366

A max_depth of 20 gives us the best result on the test dataset.

  • Train Data Regressor score of 0.8955
  • Mean Squared Error of 98.45
  • Median Absolute Error of 3.116
  • R-squared value for test data = 0.765

Visualization of fare amount and tips on the entire dataset

The following blocks of code contain analysis of fare amount and tip amount for all taxi trips i.e. the entire dataset.

First I look at the distribution of mean tip amount received by drivers.

In [141]:
%%time

trip_full_df.groupby('medallion')['tip_amount'].mean().hist(bins=100)
Wall time: 22.5 s

One possible reason the graph shows so many drivers getting low tips is that the data gets skewed with a number of records contain tip_amount = 0. If I remove these records and plot the histogram, the resulting histogram will be a more accurate portrayal of the average tip amount received by drivers.

In [143]:
%%time

trip_full_df[trip_full_df.tip_amount != 0].groupby('medallion')['tip_amount'].mean().hist(bins=100)
Wall time: 7.69 s
Out[143]:
<matplotlib.axes._subplots.AxesSubplot at 0x185145710>

I was interested to look at the relationship between tip amount and parameters such as trip distance, fare amount. I also looked at the relationship between fare amount and trip distance.

For this analysis, I only looked at the first 500,000 records and not the entire dataset.

In [144]:
trip_full_df['trip_distance'] = trip_full_df['trip_distance'].astype(float)
trip_full_df['fare_amount'] = trip_full_df['fare_amount'].astype(float)
trip_full_df['tip_amount'] = trip_full_df['tip_amount'].astype(float)
trip_full_df['total_amount'] = trip_full_df['total_amount'].astype(float)

fare_data_head = trip_full_df.head(500000)

sns.lmplot('trip_distance', 'fare_amount', fare_data_head[fare_data_head.trip_distance > 0],size = 10)
sns.lmplot('trip_distance', 'tip_amount', fare_data_head[fare_data_head.trip_distance > 0],size = 10)
sns.lmplot('fare_amount', 'tip_amount', fare_data_head[fare_data_head.trip_distance > 0],size = 10)
Out[144]:
<seaborn.axisgrid.FacetGrid at 0x1843669e8>

The relationship between tip amount and fare amount shows a a linear characteristic with several specific slopes (probably at 10%, 15%, 20% and so on).

Fare amount as a function of trip distance is also linear as expected. However, it's interesting to note that there is a constant fare amount of $52 for several trips, which is irrespective of the trip distance. This turns out to be the constant fare charged for a trip to the LaGuardia Airport.

I then looked at the distribution of the median tip fraction received by individual drivers, by grouping by the medallion numbers.

Next, I look at the overall distribution of the tip fraction. Only records for which tip fraction lies between 0 and 0.5 are looked at.

In [145]:
%%time

trip_full_df['tip_fraction'] = trip_full_df['tip_amount']/trip_full_df['fare_amount']
trip_full_df.groupby('medallion')['tip_fraction'].median().hist(bins=250)
Wall time: 4.69 s
In [146]:
%%time

trip_full_df[(trip_full_df['tip_fraction'] > 0) & (trip_full_df['tip_fraction'] < 0.5)]['tip_fraction'].hist(bins=250)
Wall time: 41.9 s
Out[146]:
<matplotlib.axes._subplots.AxesSubplot at 0x2d9531fd0>

There is a distinct spike at 0.2 or 20% tip. Also, a tip fraction of greater than 0.2 is more common than less.

Finally, I wrap it up by looking at the distribution of log of tips, since tip amounts in general was seen to be tail-heavy.

In [147]:
%%time

trip_full_df['tip_amount_log'] = trip_full_df['tip_amount'].apply(np.log10)

trip_full_df[trip_full_df['tip_amount_log'] > 0]['tip_amount_log'].hist(bins=100)
Wall time: 6min 46s

Visualizing Pickups and Dropoffs for January 2013 at different times of the day

In [148]:
from bokeh.plotting import figure, output_notebook, show
import datashader as ds

I load only the latitude and longitude data of the overall NYC Taxi dataset into a new dataframe. I first plot a histogram of the latitudes and longitudes of the pickup locations to see where most trips take place. I then use Datashader library to visualize regions where, as an aggregate, the number of pickups are more than the number of dropoffs and vice versa.

In [149]:
%time 
df = pd.read_csv('../nyc_taxi_files/trip_data_files/trip_data_1.csv', usecols= \
                       ['pickup_longitude', 'pickup_latitude', 'pickup_datetime', 'dropoff_datetime', 'dropoff_longitude','dropoff_latitude', 'passenger_count'])
df.tail()
Wall time: 0 ns
Out[149]:
pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude
14776610 2013-01-06 04:58:23 2013-01-06 05:11:24 1 -73.989029 40.759327 -73.953743 40.770672
14776611 2013-01-08 14:42:04 2013-01-08 14:50:27 1 -73.993042 40.733990 -73.982483 40.724823
14776612 2013-01-10 13:29:23 2013-01-10 13:34:45 1 -73.979553 40.785011 -73.968262 40.788158
14776613 2013-01-06 16:30:15 2013-01-06 16:42:26 1 -73.968002 40.762161 -73.985992 40.770542
14776614 2013-01-05 20:38:46 2013-01-05 20:43:06 1 -73.982224 40.766670 -73.989212 40.773636
In [150]:
df['pickup_hour']=df['pickup_datetime'].apply(lambda x: time.strptime(x, "%Y-%m-%d %H:%M:%S").tm_hour)
df['dropoff_hour']=df['dropoff_datetime'].apply(lambda x: time.strptime(x, "%Y-%m-%d %H:%M:%S").tm_hour)
In [151]:
df.tail()
Out[151]:
pickup_datetime dropoff_datetime passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude pickup_hour dropoff_hour
14776610 2013-01-06 04:58:23 2013-01-06 05:11:24 1 -73.989029 40.759327 -73.953743 40.770672 4 5
14776611 2013-01-08 14:42:04 2013-01-08 14:50:27 1 -73.993042 40.733990 -73.982483 40.724823 14 14
14776612 2013-01-10 13:29:23 2013-01-10 13:34:45 1 -73.979553 40.785011 -73.968262 40.788158 13 13
14776613 2013-01-06 16:30:15 2013-01-06 16:42:26 1 -73.968002 40.762161 -73.985992 40.770542 16 16
14776614 2013-01-05 20:38:46 2013-01-05 20:43:06 1 -73.982224 40.766670 -73.989212 40.773636 20 20
In [152]:
df.describe()
Out[152]:
passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude pickup_hour dropoff_hour
count 1.477662e+07 1.477662e+07 1.477662e+07 1.477653e+07 1.477653e+07 1.477662e+07 1.477662e+07
mean 1.697372e+00 -7.263634e+01 4.001440e+01 -7.259443e+01 3.999219e+01 1.358165e+01 1.358404e+01
std 1.365396e+00 1.013819e+01 7.789904e+00 1.028860e+01 7.537067e+00 6.401632e+00 6.461708e+00
min 0.000000e+00 -2.771285e+03 -3.547921e+03 -2.350956e+03 -3.547921e+03 0.000000e+00 0.000000e+00
25% 1.000000e+00 -7.399188e+01 4.073551e+01 NaN NaN 9.000000e+00 9.000000e+00
50% 1.000000e+00 -7.398166e+01 4.075315e+01 NaN NaN 1.400000e+01 1.400000e+01
75% 2.000000e+00 -7.396684e+01 4.076729e+01 NaN NaN 1.900000e+01 1.900000e+01
max 2.550000e+02 1.124042e+02 3.310365e+03 2.228738e+03 3.477106e+03 2.300000e+01 2.300000e+01
In [153]:
print df[(df['pickup_longitude']>-74.1)&(df['pickup_longitude']<-73.8)]['pickup_longitude'].median()
df[(df['pickup_longitude']>-74.1)&(df['pickup_longitude']<-73.8)]['pickup_longitude'].hist(bins=50)
-73.98217
Out[153]:
<matplotlib.axes._subplots.AxesSubplot at 0x1848b6cc0>
In [154]:
print df[(df['pickup_latitude']>40.6)&(df['pickup_latitude']<40.9)]['pickup_latitude'].median()
df[(df['pickup_latitude']>40.5)&(df['pickup_latitude']<40.9)]['pickup_latitude'].hist(bins=50)
40.753811
Out[154]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ccc0dac8>

I Define a simple bokeh plot. Datashader, which will be used for visualization extends from bokeh

In [155]:
output_notebook()

x_range=(-74.05,-73.8)
y_range=(40.7,40.8)

def base_plot(tools='pan,wheel_zoom,reset',plot_width=900, plot_height=600, **plot_args):
    p = figure(tools=tools, plot_width=plot_width, plot_height=plot_height,
        x_range=x_range, y_range=y_range, outline_line_color=None,**plot_args)
    
    p.axis.visible = False
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None
    return p
    
options = dict(line_color=None, fill_color='blue', size=5)

from IPython.core.display import HTML, display
display(HTML("<style>.container { width:90% !important; }</style>"))
Loading BokehJS ...
In [156]:
import datashader as ds
from datashader import transfer_functions as tf
from datashader.colors import Greys9
Greys9_r = list(reversed(Greys9))[:-2]
In [157]:
def draw_pickup_dropoff_maps(hour_start, hour_end):    
   
    '''
    Creates a datashader image of densities of pickup and dropoffs.
    Args:
        hour_start (int): Varies between 0 to 23. Taxi trips starting this hour will be considered.
        hour_end (int): Varies between 1 to 24. Taxi trips ending at this hour will be considered.
        
    Returns:
        img (stack image): datashader image interpolating total number of pickups and dropoffs.        
    '''
    
    cvs = ds.Canvas(plot_width=900, plot_height=600, x_range=x_range, y_range=y_range)

    # Pickup map
    pickup_df = df[(df['pickup_hour'] >= hour_start) & (df['pickup_hour']<hour_end)]
    agg1 = cvs.points(pickup_df, 'pickup_longitude', 'pickup_latitude')

    # Dropoff map
    dropoff_df = df[(df['dropoff_hour'] >= hour_start) & (df['dropoff_hour']<hour_end)]
    agg2 = cvs.points(dropoff_df, 'dropoff_longitude', 'dropoff_latitude')

    img1=tf.interpolate(agg1.where(agg1>agg2), cmap=["lightblue", 'blue'], how='eq_hist')
    img2=tf.interpolate(agg2.where(agg2>agg1), cmap=["lightpink", 'red'], how='eq_hist')

    img = tf.stack(img1,img2)
#     return tf.dynspread(img, threshold=0.1, max_px=4)
    return img

Using the method above, I proceed to visualize the nature of taxi trips as a day passes by. Regions where pickups are greater are marked as blue while regions where dropoffs are greater are marked red.

Shown below are 12 visualizations of pickup and dropoff trends for 2 hour time periods. I start with the trips between 12 to 2 AM and end with trips between 10 PM to 12 AM.

In [158]:
img=draw_pickup_dropoff_maps(0, 2)
img
Out[158]:
In [159]:
img=draw_pickup_dropoff_maps(2, 4)
img
Out[159]:
In [160]:
img=draw_pickup_dropoff_maps(4, 6)
img
Out[160]:
In [161]:
img=draw_pickup_dropoff_maps(6, 8)
img
Out[161]:
In [162]:
img=draw_pickup_dropoff_maps(8, 10)
img
Out[162]:
In [163]:
img=draw_pickup_dropoff_maps(10, 12)
img
Out[163]:
In [164]:
img=draw_pickup_dropoff_maps(12, 14)
img
Out[164]:
In [165]:
img=draw_pickup_dropoff_maps(14, 16)
img
Out[165]:
In [166]:
img=draw_pickup_dropoff_maps(16, 18)
img
Out[166]:
In [167]:
img=draw_pickup_dropoff_maps(18, 20)
img
Out[167]:
In [168]:
img=draw_pickup_dropoff_maps(20, 22)
img
Out[168]:
In [169]:
img=draw_pickup_dropoff_maps(22, 24)
img
Out[169]:

Project Summary

  • NYC Yellow taxi trips have been losing out to Uber and have had fewer trips on both 2014 and 2015 cmpared to previous year.
  • Maximum taxi trips take place between 6 to 8 PM i.e. after work hours.
  • A sample subset of the overall dataset was used for modeling the trip counts.
  • Trip counts were calculated using groupby function on pickup_hour, neighborhood, and date.
  • Linear and random forest regression models using both statsmodels and scikit-learn were used for modeling.
  • NYC neighborhoods and weather data were joined to the trip data in order to create the overall dataset.
  • Using a decision tree regressor with a depth of 20 gave us the best fit during cross-validation.
  • Tip amounts and tip fractions received by drivers were analyzed. The tip fraction distribution showed a peak at 20%.
  • Using datashader, pickup and dropoff trends throughout the day was visualized - it shows a clear change in the densities of pickups and dropoffs as the day progresses.